🧙🏽‍♂️

CS 285 Notes

Created by Yunhao Cao (Github@ToiletCommander) in Fall 2022 for UC Berkeley CS 285 (Sergey Levine).

Reference Notice: Material highly and mostly derived from Prof Levine's lecture slides, some ideas were borrowed from wikipedia & CS189.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

General Introduction to RL (LEC 1)

Supervised ML

Given D={(xi,yi)}D=\{(x_i,y_i)\}

we will learn to predict yy from xx.

usually assumes:

i.i.d. data (previous x,y pair does not affect next x, y pair)

known ground truth outputs in the training

Problem:

Cannot adapt if something fails.

Reinforcement Learning:

Data is not i.i.d: previous outputs influence future inputs!

Ground truth answer is not known, only know if we succeeded or failed (but we generally know the reward)

Compared to traditional reinforcement learning,

With DRL(Deep Reinforcement Learning), we can solve complex problems end-to-end!

But there are challenges:

  1. Humans can learn incredibly quickly
    1. Deep RLs are usually slow
    1. probably because humans can reuse past knowledge
  1. Not clear what the reward function should be
  1. Not clear what the role of prediction should be

Fundamental Theorem of RL(Lec 4)

Notations + Markov Property

We are actually trying to learn a distribution of action aa given the observation oo, where θ\theta is the parameter of the distribution policy π\pi. we add subscript tt to each term to denote the index in the time-serie since we know in RL that actions are usually correlated.

Sometimes instead of πθ(atot)\pi_\theta (a_t|o_t) we will see πθ(atst)\pi_\theta(a_t|s_t).

ststateπθ(atst)policy (fully observed)otobservationπθ(atot)policys_t \rightarrow \text{state} \\ \pi_\theta(a_t|s_t) \rightarrow \text{policy (fully observed)} \\ o_t \rightarrow \text{observation} \\ \pi_\theta(a_t|o_t) \rightarrow \text{policy}

Difference - can use observation to infer state, but sometimes (in this example, car in front of the cheetah) the observation does not give enough information to let us infer state.

State is the true configuration of the system and observation is something that results from the state which may or may not be enough to deduce the state
🔥
The state is usually assumed to be an Markovian state

Markov property is very important property ⇒ If you want to change the future you just need to focus on current state and your current action

However, is your future state independent of your past observation? In general, observations do not satisfy Markov property

Many RL algorithms in this course requires Markov property

Side note on notation

RL world (Richard Bellman)Robotics (Lev Pontryagin)
sts_t statextx_t state
ata_t actionutu_t action
r(s,a)r(s,a) rewardc(x,u)c(x,u) cost =r(x,u)= -r(x,u)

In general, we want to minimize:

minθEs1:T,a1:T[tc(st,at)]\min_\theta \mathbb{E}_{s_1:T, a_1: T} [\sum_t c(s_t,a_t)]

We can either write the funciton cc as cc or rr, which stands for either (minimize) cost or (maximize) reward

🔥
Notice that we want to maximize the reward (or minimize the cost) under the expectation of the states that we are going to encounter in production / test environment, not training states.

Definition of Markov Chain

M={S,T}M=\{S,T\}
Sstate spacesSstate (discrete or continuous)Ttransition operator described by P(st+1st)S \rightarrow \text{state space} \\ s \in S \rightarrow \text{state (discrete or continuous)} \\ T \rightarrow \text{transition operator described by } \mathbb{P}(s_{t+1}|s_t)

The transition “operator”?

let μt,i=P(st=i),then μt is a vector of probabilities,let Ti,j=P(st+1=isi=j),then μt+1=Tμt\text{let } \mu_{t,i} = \mathbb{P}(s_t = i), \\ \text{then $\vec{\mu}_t$ is a vector of probabilities}, \\ \text{let } T_{i,j} = \mathbb{P}(s_{t+1}=i|s_i=j), \\ \text{then } \vec{\mu}_{t+1} = T\vec{\mu}_t

Definition of Markov Decision Process

M={S,A,T,r}M = \{S,A,T,r\}
Saction space,sSstate (discrete or continuous)Aaction space,aAaction (discrete or continuous)r:S×ARS \rightarrow \text{action space}, \\ s \in S \rightarrow \text{state (discrete or continuous)} \\ A \rightarrow \text{action space}, \\ a \in A \rightarrow \text{action (discrete or continuous)} \\ r: S \times A \rightarrow \mathbb{R} \rightarrow \\
Ttransitional operator (now a tensor)μt,i=P(st=i),ξt,k=P(at=k),So Ti,j,k=P(st+1=ist=j,at=k)μt+1,i=j,kTi,j,kμt,jξt,kT \rightarrow \text{transitional operator (now a tensor)} \\ \mu_{t,i} = \mathbb{P}(s_t=i), \\ \xi_{t,k} = \mathbb{P}(a_t=k), \\ \text{So } T_{i,j,k} = \mathbb{P}(s_{t+1}=i | s_t=j,a_t=k) \\ \mu_{t+1,i} = \sum_{j,k}T_{i,j,k} \mu_{t,j} \xi_{t,k}

Partially Observed Markov Decision Process

M={S,A,O,T,ξ,r}Sstate spacesSstate (discrete or continuous)Aaction spaceaAaction (discrete or continuous)Oobservation spaceoOobservation (discrete or continuous)Ttransition operationξemission probability P(otst)r:S×ARreward functionM = \{S,A,O,T,\xi,r\} \\ S \rightarrow \text{state space} \\ s \in S \rightarrow \text{state (discrete or continuous)} \\ A \rightarrow \text{action space} \\ a \in A \rightarrow \text{action (discrete or continuous)} \\ O \rightarrow \text{observation space} \\ o \in O \rightarrow \text{observation (discrete or continuous)} \\ T \rightarrow \text{transition operation} \\ \xi \rightarrow \text{emission probability $\mathbb{P}(o_t|s_t)$} \\ r: S \times A \rightarrow \mathbb{R} \longrightarrow \text{reward function}

Goal of RL (Finite Horizon / Episodic)

pθ(s1,a1,,sT,aT)undefinedPθ(τ),where τ represents sequence of trajectory=P(s1)t=1Tπθ(atst)P(st+1st,at)undefinedMarkov chain on (s,a)\underbrace{p_{\theta}(s_1,a_1,\dots,s_T,a_T)}_{\mathclap{\mathbb{P}_\theta(\tau), \text{where $\tau$ represents sequence of trajectory}}}=\mathbb{P}(s_1)\prod_{t=1}^T \underbrace{\pi_{\theta(a_t|s_t)} \mathbb{P}(s_{t+1}|s_t,a_t)}_{\mathclap{\text{Markov chain on $(s,a)$}}}

So our goal is to maximize the expectation of rewards

θ=arg maxθEτpθ[tr(st,at)]\theta^* = \argmax_\theta \mathbb{E}_{\tau \sim p_\theta}[\sum_t r(s_t,a_t)]
Group action and state together to form augmented states
Augmented states form a Markov Chain.

The markov chain can be described using:

[st+1at+1]=T[stat][st+kat+k]=Tk[stat]\begin{split} \begin{bmatrix} s_{t+1} \\ a_{t+1} \end{bmatrix} = T\begin{bmatrix} s_t \\ a_t \end{bmatrix} \\ \begin{bmatrix} s_{t+k} \\ a_{t+k} \end{bmatrix} = T^k \begin{bmatrix} s_t \\ a_t \end{bmatrix} \end{split}

So now we can modify the goal a bit towards the markov chain.

θ=arg maxθEτpθ(τ)[tr(st,at)]=arg maxθt=1TE(st,at)pθ(st,at)[r(st,at)]\begin{split} \theta^* &= \argmax_\theta \mathbb{E}_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t,a_t)] \\ &= \argmax_\theta \sum_{t=1}^T \mathbb{E}_{(s_t,a_t)\sim p_\theta(s_t,a_t)}[r(s_t,a_t)] \end{split}

Goal of RL (Infinite Horizon)

What if T=T = \infin? (Infinite horizon)

  1. We need some way to make objective finite
    1. Use average award θ=arg maxθ1TEτpθ(τ)[tr(st,at)]\theta^* = \argmax_\theta \frac{1}{T}\mathbb{E}{\tau \sim p\theta(\tau)}[\sum_t r(s_t,a_t)]
    1. Use discounts

Q: Does p(st,at)p(s_t,a_t) converge to a stationary distribution as tt \rightarrow \infin?

Ergodic Property: Every state can transition into another state with a non-zero probability.

“Stationary”: The same before and after transition

μ=Tμundefinedstationary means the state is same and before after transition\underbrace{\mu = T \mu}_{\text{stationary means the state is same and before after transition}}
(TI)μ=0(T-I)\mu = 0

This means that μ\mu is an eigenvector of TT with eigenvalue of 1 (always exists under some regularity conditions)

μ=pθ(s,a)stationary distribution\mu = p_\theta(s,a) \rightarrow \text{stationary distribution}

As tt goes further and further from infinity, the reward terms starts to get dominated by the stationary distribution (we have almost inifite μ\mu states and a finite non-stationary action/state pairs

limTθ=limTarg maxθ1Tt=1TE(st,at)pθ(st,at)[r(st,at)]=arg maxθE(s,a)pθ(s,a)[r(s,a)]\begin{split} \lim_{T \rightarrow \infin} \theta^* &= \lim_{T \rightarrow \infin} \argmax_\theta \frac{1}{T}\sum_{t=1}^T \mathbb{E}_{(s_t,a_t) \sim p\theta(s_t,a_t)}[r(s_t,a_t)] \\ &= \argmax_{\theta} \mathbb{E}_{(s,a) \sim p_\theta(s,a)}[r(s,a)] \end{split}

Expectations and Stochastic Systems

RL is about maximizing the expectation of rewards. And expectations can be continuous in the parameters of the corresponding distributions even when the function we are taking expectation of is itself highly discontinuous. Because of this property we can use smooth optimization methods like Gradient Descent
📌
Because the distribution of the probability of getting a reward usually has a continuous PDF.

High-level anatomy of reinforcement learning algorithms

Which parts are expensive?

Value Function & Q Function

Remember that RL problems are framed as argmax of expectation of rewards, but how do we actually deal with the expectations?

📌
Write it out recursively!

We can write this expectation out explicitly as a nested expectation (looks like a DP algorithm!)

Eτp(s1)[t=1Tr(st,at)]=Es1p(s1)[Ea1π(a1s1)[r(s1,a1)+Es2p(s2s1,a1)[Ea2π(a2s2)[r(s2,a2)+s2]s1,a1]s1]]\mathbb{E}_{\tau \sim p(s_1)}[\sum_{t=1}^T r(s_t,a_t)] = \\ \mathbb{E}_{s_1 \sim p(s_1)}[\mathbb{E}_{a_1 \sim \pi(a_1|s_1)}[r(s_1,a_1)+\mathbb{E}_{s_2 \sim p(s_2|s_1,a_1)}[\mathbb{E}_{a_2 \sim \pi(a_2|s_2)}[r(s_2,a_2)+\cdots|s_2]|s_1,a_1]|s_1]]
Looks like its messy, but what if we have a function that tells us what value inside the Ea1π(a1s1)\mathbb{E}_{a_1 \sim \pi(a_1|s_1)} is? Specifically, its the value of r(s1,a1)+Es2p(s2s1,a1)[Ea2π(a2s2)[r(s2,a2)+s2]s1,a1]r(s_1,a_1)+\mathbb{E}_{s_2 \sim p(s_2|s_1,a_1)}[\mathbb{E}{a_2 \sim \pi(a_2|s_2)}[r(s_2,a_2)+\cdots|s_2]|s_1,a_1] We will define it as our QQ function
Q(s1,a1)=r(s1,a1)+Es2p(s2s1,a1)[Ea2π(a2s2)[r(s2,a2)+s2]s1,a1]Q(s_1,a_1) = r(s_1,a_1)+\mathbb{E}_{s_2 \sim p(s_2|s_1,a_1)}[\mathbb{E}_{a_2 \sim \pi(a_2|s_2)}[r(s_2,a_2)+\cdots|s_2]|s_1,a_1]

Then our equation is simplified into

Eτp(s1)[t=1Tr(st,at)]=Es1p(s1)[Ea1π(a1s1)[Q(s1,a1)s1]]\mathbb{E}_{\tau \sim p(s_1)}[\sum_{t=1}^T r(s_t,a_t)] = \\ \mathbb{E}_{s_1 \sim p(s_1)}[\mathbb{E}_{a_1 \sim \pi(a_1|s_1)}[Q(s_1,a_1)|s_1]]

It’s a lot easier to modify the parameter θ\theta of πθ(a1s1)\pi_\theta(a_1|s_1) if Q(s1,a1)Q(s_1,a_1) is known


Q-Function Definition

Qπ(st,at)=t=tTEπθ[r(st,at)st,at]total reward from taking at in stQ^{\pi}(s_t,a_t)=\sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t] \rightarrow \text{total reward from taking $a_t$ in $s_t$}

Value Function Definition

Vπ(st)=t=tTEπθ[r(st,at)st]total reward from stV^{\pi}(s_t)=\sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_t] \rightarrow \text{total reward from $s_t$}
Vπ(st)=Eatπ(atst)[Qπ(st,at)]V^{\pi}(s_t) = \mathbb{E}_{a_t \sim \pi(a_t|s_t)}[Q^{\pi}(s_t,a_t)]

Now Es1p(s1)[Vπ(s1)]\mathbb{E}_{s_1}\sim p(s_1)[V^{\pi}(s_1)] is the RL objective.

To use those two functions:

If we have policy π\pi, and we know Qπ(s,a)Q^{\pi}(s,a), then we can improve π\pi:

set π(as)=1 if a=arg maxaQπ(s,a)\text{set } \pi^{'}(a|s)=1 \text{ if } a = \argmax_a Q^{\pi}(s,a)

(This policy is at least as good as π\pi and probably better)

OR

Compute gradient to increase the probability of good actions aa if Qπ(s,a)>Vπ(s)Q^{\pi}(s,a) > V^{\pi}(s), then aa is better than average. Recall that Vπ(s)=E[Qπ(s,a)]V^{\pi}(s) = \mathbb{E}[Q^{\pi}(s,a)] under π(as)\pi(a|s).

Modify π(as)\pi(a|s) to increase the probability of a if Qπ(s,a)>Vπ(s)Q^{\pi}(s,a) > V^{\pi}(s)

Bellman Operator

First defined in fixed Q-Iteration

Information Theory (LEC 14)

Some useful identities:

  1. p(x)p(x) - distribution (over observations x)
  1. H(p(x))=Exp(x)[logp(x)]H(p(x))= -\mathbb{E}_{x \sim p(x)}[\log p(x)] ⇒ entropy (how “broad” p(x) is)
  1. I(x;y)=DKL(p(x,y)p(x)p(y))=E(x,y)p(x,y)[logp(x,y)p(x)p(y)]=H(p(y))H(p(yx))I(x;y) = D_{KL}(p(x,y)||p(x)p(y)) = \mathbb{E}_{(x,y) \sim p(x,y)}[\log \frac{p(x,y)}{p(x) p(y)}] = H(p(y)) - H(p(y|x))
    1. mutual information between two variables
    1. reduction in entropy after knowing xx
    1. See “empowerment” (Polani et al.) for a typical use
      1. I(st+1;at)=H(st+1)H(st+1at)I(s_{t+1}; a_t) = H(s_{t+1}) - H(s_{t+1}|a_t)
      1. Can be viewed as quantifying “control authority” in an information-theoretic way

Optimal Control and Planning Algorithms

🔥
In Model-based learning, we need some optimal policy extraction steps to be able to extract a policy from a environment.

Model free completely discard the transition dynamics by sampling from the data points

  1. But we often do know the dynamics
    1. Games
    1. Easily Modeled Systems
    1. Simulated Environments
  1. Or we can learn the dynamics
    1. System identification (fit unknown parameters of a known model)
    1. Learning (fit a general-purpose model to observed transition data)
Knowing the dynamics make things easier

But how do we extract policies from a known dynamics?

Objective of optimal control:

mina1,,aTE(c(s1,s2,,sT)a1,,aT)\min_{a_1, \dots, a_T} \mathbb{E}(c(s_1,s_2,\dots,s_T)|a_1,\dots,a_T)

Open-loop Planning

a1,,aT=arg maxa1,,aTE[tr(st,at)a1,,aT]a_1,\dots,a_T = \argmax_{a_1,\dots,a_T} \mathbb{E}[\sum_t r(s_t,a_t)|a_1,\dots,a_T]

This is sub-optimal because we are not considering that later when we have executed some actions ⇒ we may get supplemented with other information.

Stochastic Optimization

Abstracts away optimal control/planning ⇒ just optimizes a function

a1,,aT=arg maxa1,,aTJ(a1,,aT)A=arg maxAJ(A)a_1,\dots,a_T = \argmax_{a_1,\dots,a_T} J(a_1,\dots,a_T) \rightarrow A = \argmax_{A} J(A)

Simplest Method: Guess and Check (”Random Shooting” method)

  1. pick A1,,ANA_1, \dots, A_N from some distribution (e.g. uniform)
  1. Choose AiA_i based on arg maxiJ(Ai)\argmax_i J(A_i)

Cross-Entropy Method (CEM)

Iterative improvement of the “random shooting method”

  1. sample A1,,ANA_{1}, \dots, A_{N} from p(A)p(A) (typically start from Gaussian distribution)
  1. Evaluate J(A1),,J(AN)J(A_1), \dots, J(A_N)
  1. Pick the elites Ai1,,AiMA_{i_1}, \dots, A_{i_M} with the highest value (M<NM < N)
  1. Refit p(A)p(A) to the elites Ai1,,AiMA_{i_1}, \dots, A_{i_M}
  1. Repeat this process
See also CMA-ES ⇒ CEM with momentum

  1. Very fast if parallelized
  1. Extremely Simple
  1. But very harsh dimensionality limit
  1. Only open-loop planning

Close-loop Planning

Monte-Carlo Tree Search (MCTS)

Discrete Planning can be viewed as a tree search problem but problem is that with vanilla tree search, the complexity is exponential in time steps

So can we restrict the number of searches without expanding the whole tree?

The idea is to stop at depth xx and with the state, approximate the outcomes that come after by using a baseline policy.

Whicih path do we search first?

Choose nodes with best reward, but also prefer rarely visited nodes

So the algorithm:

  1. Find a leaf sls_l using TreePolicy(s1)\text{TreePolicy}(s_1)
  1. Evaluate the leaf using DefaultPolicy(sl)\text{DefaultPolicy}(s_l)
    1. Start from the root and execute every action in the tree to arrive at the leaf / a new leaf(due to randomness)
  1. Update all values in tree between s1s_1 and sls_l
  1. Repeat this iteration and then choose best action from s1s_1

UCT TreePolicy(sts_t)

  1. If sts_t not fully expanded, choose new ata_t else choose child with best Score (st+1s_{t+1})
Score(st)=Q(st)N(st)+2C2lnN(St1N(st)Score(s_t)=\frac{Q(s_t)}{N(s_t)}+2C\sqrt{\frac{2 \ln N(S_{t-1}}{N(s_t)}}
Browne, Powley, Whitehouse, Lucas, Cowling, Rohlfshagen, Travener, Perez, Samothrakis, Colton. (2012). A Survey of Monte Carlo Tree Search Methods

Shootting Methods vs. Collocation

  1. Shotting method: optimize over actions only
    1. Earlier actions have larger effects on the trajectory
  1. Collocation method: Optimize over actions and states, with constraints

Trajectory Optimization

Because this is a trajectory optimization lecture, we will use xtx_t for state and utu_t for action

If we wanna use gradients, it usually helps to use a 2nd order method to optimize via backpropagation!

Linear Case: Linear Quadratic Regularizer (LQR)

Objective:

minu1,,uTc(x1,u1)+c(f(x1,u1),u2)++c(f(f()),uT)\min_{u_1,\dots,u_T} c(x_1,u_1)+c(f(x_1,u_1),u_2)+\cdots+c(f(f(\dots)\dots),u_T)

We will assume that

f(xt,ut)=Ft[xtut]+ftf(x_t,u_t)=F_t\begin{bmatrix} x_t \\ u_t \end{bmatrix} + f_t
c(xt,ut)=12[xtut]Ct[xtut]+[xtut]ctc(x_t,u_t)=\frac{1}{2} \begin{bmatrix} x_t &u_t \end{bmatrix} C_t \begin{bmatrix} x_t \\ u_t \end{bmatrix} + \begin{bmatrix} x_t &u_t \end{bmatrix} c_t

Base case: Solve for uTu_T only

Q(xT,uT)=const+12[xTuT]CT[xTuT]+[xTuT]cTQ(x_T,u_T)=\text{const}+\frac{1}{2}\begin{bmatrix} x_T \\ u_T \end{bmatrix}^{\top} C_T \begin{bmatrix} x_T \\ u_T \end{bmatrix} + \begin{bmatrix} x_T \\ u_T \end{bmatrix}^\top c_T

We can break CTC_T and cTc_T into pieces that affects uu and affects xx

(Note that here we assumed CuT,xT=CxT,uTC_{u_T, x_T} = C_{x_T,u_T}^\top

CT=[CxT,XTCxT,uTCuT,xTCuT,uT]C_T = \begin{bmatrix} C_{x_T,X_T} &C_{x_T,u_T} \\ C_{u_T,x_T} &C_{u_T,u_T} \end{bmatrix}
cT=[cxTcuT]c_T = \begin{bmatrix} c_{x_T} \\ c_{u_T} \end{bmatrix}
uTQ(xT,uT)=CuT,xTxT+CuT,uTuT+cuT=0uT=CuT,uT1(CuT,xTxT+cuT)=KTxT+kT\nabla_{u_T} Q(x_T,u_T) = C_{u_T,x_T} x_T + C_{u_T,u_T} u_T + c_{u_T}^{\top} = 0 \\ u_T = -C_{u_T,u_T}^{-1}(C_{u_T,x_T}x_T+c_{u_T}) = K_Tx_T+k_T

Subce uTu_T is fully determined by xTx_T, we can write our cost function as

V(xT)=const+12[xTKTxT+kT]CT[xTKTxT+kT]+[xTKTxT+kT]cT=const+12xTVTxT+xTvT\begin{split} V(x_T) &= \text{const} + \frac{1}{2} \begin{bmatrix} x_T \\ K_Tx_T+k_T \end{bmatrix}^{\top} C_T \begin{bmatrix} x_T \\ K_Tx_T+k_T \end{bmatrix} + \begin{bmatrix} x_T \\ K_T x_T + k_T \end{bmatrix}^\top c_T \\ &=\text{const} + \frac{1}{2} x_T^\top V_T x_T + x_T^\top v_T \end{split}

where

VT=CxT,xT+CxT,uTKT+KTCuT,xT+KTCuT,uTKTvT=cxT+CxT,uTkT+KTCuT+KTCuT,uTkTV_T = C_{x_T, x_T}+C_{x_T, u_T}K_T + K_T^\top C_{u_T,x_T} + K_T^\top C_{u_T,u_T}K_T \\ v_T = c_{x_T}+C_{x_T,u_T}k_T+K_T^\top C_{u_T} + K_T^\top C_{u_T, u_T}k_T

So what if we solve for uT1u_{T-1}? ⇒ uT1u_{T-1} affects xTx_T!

f(xT1,uT1)=xT=FT1[xT1uT1]+fT1f(x_{T-1},u_{T-1})=x_T = F_{T-1}\begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix} + f_{T-1}
Q(xT1,uT1)=const+12[xT1uT1]CT1[xT1uT1]+[xT1uT1]cT1+V(f(xT1,uT1))Q(x_{T-1},u_{T-1})=\text{const}+\frac{1}{2}\begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^\top C_{T-1} \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix} + \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^{\top} c_{T-1} + V(f(x_{T-1},u_{T-1}))
V(xT)=const+12[xT1uT1]FT1VTFT1[xT1uT1]+[xT1uT1]FT1VTfT1+[xT1uT1]FT1vTV(x_T)=\text{const} + \frac{1}{2} \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^\top F_{T-1}^\top V_T F_{T-1} \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix} + \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^\top F_{T-1}^\top V_T f_{T-1} + \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^\top F_{T-1}^\top v_T

Combining the two equations

Q(xT1,uT1)=const+12[xT1uT1]QT1[xT1uT1]+[xT1uT1]qT1Q(x_{T-1},u_{T-1})=\text{const} + \frac{1}{2}\begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^\top Q_{T-1} \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix} + \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^\top q_{T-1}
QT1=CT1+FT1VTFT1qT1=cT1+FT1VTfT1+FT1vTQ_{T-1} = C_{T-1} + F_{T-1}^\top V_T F_{T-1} \\ q_{T-1} = c_{T-1} + F_{T-1}^\top V_T f_{T-1} + F_{T-1}^\top v_T

Now we can derive the derivative

uT1Q(xT1,uT1)=QuT1,xT1xT1+QuT1,uT1uT1+quT1=0uT1=KT1xT1+kT1\nabla_{u_{T-1}} Q(x_{T-1}, u_{T-1}) = Q_{u_{T-1},x_{T-1}}x_{T-1} + Q_{u_{T-1},u_{T-1}} u_{T-1} + q_{u_{T-1}}^\top = 0 \\ u_{T-1} = K_{T-1}x_{T-1} + k_{T-1}

Where

KT1=QuT1,uT11QuT1,xT1kT1=QuT1,uT11quT1K_{T-1} = -Q_{u_{T-1},u_{T-1}}^{-1} Q_{u_{T-1},x_{T-1}} \\ k_{T-1} = -Q_{u_{T-1},u_{T-1}}^{-1} q_{u_{T-1}}

So looks like we can start from the last timestep and do backward recursion!

After doing backward recursion, we can compute our utu_t and xtx_t by forward recursion

LQR for Stochastic and Nonlinear Systems

Stochastic dynamics:

If

p(xt+1xt,ut)=N(Ft[xtut]+ft,Σt)p(x_{t+1}|x_t,u_t) = N(F_t \begin{bmatrix} x_t \\ u_t \end{bmatrix} + f_t, \Sigma_t )

Then just choose actions according to the linear case (since the randomness is symmetric)

Nonlinear case - Differentiable Dynamic Programming (DDP) / Iterative LQR (ILQR)

Can we approximate a nonlinear system as a linear-quadratic system? (nearby)
f(xt,ut)f(x^t,u^t)+xt,utf(x^t,u^t)[xtx^tutu^t]f(x_t,u_t) \approx f(\hat{x}_t,\hat{u}_t)+\nabla_{x_t,u_t} f(\hat{x}_t,\hat{u}_t)\begin{bmatrix} x_t - \hat{x}_t \\ u_t - \hat{u}_t \end{bmatrix}
c(xt,ut)c(x^t,u^t)+xt,utc(x^t,x^t)+12[xtx^tutx^t]xt,ut2c(x^t,u^t)[xtx^tutu^t]c(x_t,u_t) \approx c(\hat{x}_t, \hat{u}_t)+\nabla_{x_t,u_t} c(\hat{x}_t,\hat{x}_t)+\frac{1}{2} \begin{bmatrix} x_t - \hat{x}_t \\ u_t - \hat{x}_t \end{bmatrix}^\top \nabla_{x_t, u_t}^2 c(\hat{x}_t, \hat{u}_t) \begin{bmatrix} x_t - \hat{x}_t \\ u_t - \hat{u}_t \end{bmatrix}

We will denote the change in xt,utx_t, u_t as δxt,δut\delta x_t, \delta u_t

fˉ(δxt,δut)=Ft[δxtδut]\bar{f}(\delta x_t, \delta u_t) = F_t \begin{bmatrix} \delta x_t \\ \delta u_t \end{bmatrix}
cˉ(δxt,δut)=12[δxtδut]Ct[δxtδut]+[δxtδut]ct\bar{c}(\delta x_t, \delta u_t) = \frac{1}{2} \begin{bmatrix} \delta x_t \\ \delta u_t \end{bmatrix}^\top C_t \begin{bmatrix} \delta x_t \\ \delta u_t \end{bmatrix} + \begin{bmatrix} \delta x_t \\ \delta u_t \end{bmatrix}^\top c_t

Where Ft=xt,utf(x^t,u^t)F_t = \nabla_{x_t,u_t} f(\hat{x}_t, \hat{u}_t), Ct=xt,ut2c(x^t,u^t)C_t = \nabla_{x_t,u_t}^2 c(\hat{x}_t, \hat{u}_t), ct=xt,utc(x^t,u^t)c_t = \nabla_{x_t,u_t} c(\hat{x}_t, \hat{u}_t)

Now we can use LQR with dynamics fˉ\bar{f}, cost cˉ\bar{c} to find δxt\delta x_t and δut\delta u_t

IQLR is basically just an approximation of Neuton’s method for solving the control objective.

But we used a first-order dynamics approximation here, to get Newton’s method, we need to use a second order dynamics approximation (DDP)

🔥
But this may be a bad idea because we are always going to the approximated minima and the approximation is only usable within a short range We can just add a constant α\alpha term to the ktk_t in the forward pass ⇒ allows us how much we want to control from our starting point (the smaller α\alpha is the closer we are to the original starting point)

Additional Reading

  1. Mayne, Jacobson. (1970). Differential dynamic programming.
    1. Original differential dynamic programming algorithm.
  1. Tassa, Erez, Todorov. (2012). Synthesis and Stabilization of Complex Behaviors through Online Trajectory Optimization.
    1. Practical guide for implementing non-linear iterative LQR.
  1. Levine, Abbeel. (2014). Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics.
    1. Probabilistic formulation and trust region alternative to deterministic line search.

Types of RL Algorithms

Remember the objective of RL:

θ=arg maxθEτpθ(τ)[tr(st,at)]\theta^{*} = \argmax_\theta \mathbb{E}_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]
  1. Policy Gradient
    1. Directly differentiate the above objective
  1. Value-based
    1. Estimate value function or Q-function of the optimal policy (no explicit policy)
    1. Then use those functions to prove policy
  1. Actor-critic (A mix between policy gradient and value-based)
    1. Estimate value function or Q-function of the current policy, use it to improve policy
  1. Model-based RL
    1. Estimate the transition model, and then
      1. Use it for planning (no explicit policy)
      1. Use it to improve a policy
      1. Other variants

Supervised Learning of Behaviors(LEC 2)

Imitation Learning

⚠️
Does it work: NO!!!!

In practice, after a long time of training it actually worked.

“tightrope walker scenerio”: has to follow exact same behavior otherwise the model does not know what to do

Three angles supervised to steer differently! This trick helped a lot

More like in training data we have scenerios to train the machine how to correct its little mistakes.

Instead of being clever about pπθ(ot)p_{\pi_\theta}(o_t), let’s be clever about pdata(ot)p_{data}(o_t) ⇒ Do DAgger(Data Aggregation)

DAgger (Data Aggregation)

🔥
Addresses “Distributional Drift”, adds on-policy data to the model

Idea: To collect data from pπθ(ot)p_{\pi_\theta}(o_t) instead of pdata(ot)p_{data}(o_t)

How: Just run πθ(ot)\pi_\theta(o_t) but we need labels ata_t

  1. train πθ(atot)\pi_\theta(a_t|o_t) on human dataset D={o1,a1,,oN,aN}D = \{o_1,a_1,\dots,o_N,a_N\}
  1. run πθ(atot)\pi_\theta(a_t|o_t) to obtain Dπ={o1,,oM}D_\pi = \{o_1,\dots,o_M\}
  1. Ask humans to label DπD_\pi with actions ata_t
  1. Aggregate DDDπD \leftarrow D \cup D_\pi

But:

  1. Lots of the problem is in step 3
    1. We as humans have to watch for feedback and then give optimal actions, cannot just watch a state and give output
  1. What if the model does NOT drift?
  1. Need to mimic expert behavior very accurately

Why might we fail the expert?

Figure 1
Figure 2
  1. Non-Markovian Behavior
    1. Unnatural for humans to do perfect Markovian actions: Humans more like πθ(ato1,,ot)\pi_\theta (a_t|o_1, \dots, o_t) ⇒ based on previous observations
    1. Solve this by adding RNN / LSTM cells to the network (Fig. 1)
      1. May still work poorly (Figure 2)
  1. Multimodal Behavior (Multiple Modes/Peaks in real distribution)
    1. If discrete actions, then this is not an issue
      1. But continuous requires exponential discrete bins (for softmax)
    1. Solved by
      1. Output mixture of Gaussians
        1. π(ao)=iwiN(μi,Σi)\pi(a|o)=\sum_i w_i N(\mu_i,\Sigma_i)
        1. Tradeoffs:
          1. need more output parameters
          1. The ability to model with this in high dimensions is challenging (in theory the number of mixture elements rises exponentially with dimension increase)
      1. Latent variable models
        1. The output is still Gaussian
        1. In addition to inputting an image into NN, we also input a latent variable(may be drawn from a prior distribution) into the model.
          1. Conditional Variational Autoencoders (VAEs)
          1. Normalizing Flow/realNVP
          1. Stein Variational Gradient Descent
        1. Complex to train
      1. Audoregressive discretization
        1. discretizes one dimension at a time using an NN trick ⇒ never incur the exponential cost
        1. Adds a softmax for one dimension, does a discrete sampling (and obtains a dim 1 value), feed the dim 1 value into another NN and softmax layer, do sampling, obtain dim 2 value, so on.
        1. Have to pick bins

Autoregressive discretization

Questions:

  1. Does including history info (LSTM/RNN) mitigate causal confusion?
    1. My guess is no since history has no correlation with confusion
  1. Can DAgger mitigate causal confusion?
    1. My guess is yes since the model will confuse and then this the part of data that the model is confused on will then be manually labeled.

Whats wrong with imitation learning:

  1. Humans need to provide data, which is limited
    1. DL works best when data is plentiful
  1. Humans are not good at providing some actions
  1. Humans can learn autonomously, can we do the same on machines?
    1. Unlimited data from own experience
    1. continuous self-improvement

Analysis of Imitation Learning

Ross et al. “Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning”

We assume:

c(s,a)={0if a=π(s)1otherwiseπθ(aπ(s)s)ϵ for sptrain(s)the model is doing well on training setactually enough for Eptrain(s)[πθ(aπ(s)s]ϵ\begin{split} & c(s,a) = \begin{cases} 0 & \text{if } a = \pi^*(s) \\ 1 & \text{otherwise} \end{cases}\\ & \pi_\theta(a \ne \pi^*(s)|s) \le \epsilon \space \text{for } s \sim p_{train}(s) \quad \leftarrow \text{the model is doing well on training set} \\ & \text{actually enough for } \mathbb{E}_{p_{train}(s)}[\pi_\theta(a \ne \pi^*(s)|s] \le \epsilon \\ \end{split}

then…

with DAgger

ptrain(s)pθ(s)E[tc(st.at)]ϵTp_{train}(s) \rightarrow p_\theta (s) \\ \mathbb{E}[\sum_t c(s_t.a_t)] \le \epsilon T

Without DAgger

if ptrain(s)pθ(s):pθ(st)=(1ϵ)tundefinedprobability that we made no mistakesptrain(st)+(1(1ϵ)t)pmistake(st)\begin{split} &\text{if } p_{train}(s) \ne p_\theta(s): \\ & p_\theta(s_t)= \underbrace{(1-\epsilon)^t}_{\mathclap{\text{probability that we made no mistakes}}} p_{train}(s_t) + (1-(1-\epsilon)^t) p_{mistake}(s_t) \\ \end{split}

We cannot assume anything for pmistakep_{mistake}

But we can bound the total variation divergence of training and testing

pθ(st)ptrain(st)=(1(1ϵ)t)pmistake(st)ptrain(st)undefinedsubstitute in the above equation in for pθ(st)(1(1ϵ)t)×2| p_{\theta}(s_t)-p_{train}(s_t)| = \underbrace{(1-(1-\epsilon)^t)|p_{mistake}(s_t)-p_{train}(s_t)|}_{\mathclap{\text{substitute in the above equation in for $p_\theta(s_t)$}}} \le (1-(1-\epsilon)^t) \times 2

Since (1ϵ)t1ϵt(1-\epsilon)^t \ge 1-\epsilon t for ϵ[0,1]\epsilon \in [0,1],

pθ(st)ptrain(st)2(1ϵt)| p_{\theta}(s_t)-p_{train}(s_t)| \le 2(1-\epsilon t)

So finally

Epθ(st)[tc(st,at)]=tstpθ(st)ct(st)tstptrain(st)ct(st)+pθ(st)ptrain(st)cmaxtϵ+2ϵtϵT+2ϵT2O(ϵT2)\begin{split} \mathbb{E}_{p_\theta(s_t)}[\sum_t c(s_t,a_t)] &= \sum_t \sum_{s_t} p_\theta (s_t) c_t(s_t) \\ &\le \sum_t \sum_{s_t} p_{train}(s_t) c_t(s_t) + |p_\theta(s_t) - p_{train}(s_t)| c_{max} \\ &\le \sum_t \epsilon + 2 \epsilon t \\ &\le \epsilon T + 2 \epsilon T^2 \\ & \in O(\epsilon T^2) \end{split}

Another way to imitate (Goal Conditional Behavioral Cloning)

After clarification from class:

  1. Sometimes we have bad “demonstrations” that may lead to different end states
  1. Those demonstrations may be close to each other enough that we can train some “shared policy” for each of those different end states (in a sense that most of previous states can be obtained by a shared weight)
  1. During production / test, specify the end state you want to achieve and then we can predict πθ(as,g)\pi_\theta(a|s,g)

“Learning to Reach Goals via Iterated Supervised Learning” - Dibya, Abhishek

Model-Based RL

e.g.

  1. Dyna
  1. Guided Policy Search

Generate samples(run the policy) ⇒

Fit a model of p(st+1st,at)p(s_{t+1}|s_t,a_t)

Then improve the policy(a few options)

Improving policy:

  1. Just use the model to plan (no policy)
    1. Trajectory optimization / optimal control (primarily in continuous spaces)
      1. Backprop to optimize over actions
    1. Discrete planning in discrete action spaces
      1. Monte Carlo tree search
  1. Backprop gradients into the policy
    1. Requires some tricks to make it work
  1. Use the model to learn a separate value function or Q function
    1. Dynamic Programming
    1. Generate simulated experience for model-free learner

Value function based algorithms

e.g.

  1. Q-Learning, DQN
  1. Temporal Difference Learning
  1. Fitted Value Iteration

Generate samples(run the policy) ⇒

Fit a model of V(s)V(s) or Q(s,a)Q(s,a)

Then improve the policy(set π(s)=arg maxθQ(s,a)\pi(s) = \argmax_\theta Q(s,a))

Direct Policy Gradients

e.g.

  1. REINFORCE Natural Policy Gradient
  1. Trust Region Policy Optimization

Generate samples (run the policy) ⇒

Estimate the return

Rτ=tr(st,at)R_\tau = \sum_t r(s_t,a_t)

Improve the policy by

θθ+αθE[tr(st,at)]\theta \leftarrow \theta + \alpha \nabla_\theta \mathbb{E}[\sum_t r(s_t,a_t)]

Actor-critic

e.g.

  1. Asynchronous Advantage Actor-Critic (A3C)
  1. Soft actor-critic (SAC)

Generate samples ⇒

Fit a model V(s)V(s) or Q(s,a)Q(s,a)

Improve the policy

θθ+αθE[tr(st,at)]\theta \leftarrow \theta + \alpha \nabla_\theta \mathbb{E}[\sum_t r(s_t,a_t)]

Tradeoff between algorithms

Sample efficiency

“How many samples for a good policy?”
  1. Most important question: Is the algorithm off-policy?
    1. Off policy means being able to improve the policy without generating new samples from that policy
    1. On policy means we need to generate new samples even if the policy is changed a little bit
⚠️
We may still want to use a less efficient algorithm because of the other tradeoffs and maybe the price of generating new samples is really cheap.

Stability & Ease of use

  1. Does our policy converge, and if it does, to what?
🔥
Supervised Learning: Almost always Gradient Descent is very likely to converge Reinforcement Learning: Often not Gradient Descent
  1. Value function fitting:
    1. Fixed point iteration
    1. At best, minimizes error of fit (“Bellman error”)
      1. Not the same as expected reward
    1. At worst, doesn’t optimize anything
      1. Many popular DRL value fitting algorithms are not guaranteed to converge to anything in the nonlinear case
  1. Model-based RL
    1. Model is not optimized for expected reward
    1. Model minimizes error of fit
      1. This will converge
    1. No guarantee that better model = better policy
  1. Policy Gradient
    1. The only one that performs Gradient Descent / Ascent on the true objective

Different assumptions

  1. Stochastic or deterministic environments?
  1. Continuous or discrete (states and action)?
  1. Episodic(finite TT) or infinite horizon?
  1. Different things are easy or hard in different settings
    1. Easier to represent the policy?
    1. Easier to represent the model?

Common Assumptions:

  1. Full observability
    1. Generally assumed by value function fitting methods
    1. Can be mitigated by adding recurrence
  1. Episodic Learning
    1. Often assumed by pure policy gradient methods
    1. Assumed by some model-based RL methods
    1. Although other methods not assumed, tend to work better under this assumption
  1. Continuity or smoothness
    1. Assumed by some continuous value function learning methods
    1. Often assumed by some model-based RL methods

Model-Free Alogorithms

Policy Gradients

”REINFORCE algorithm”

Remember: Direct gradient descent on RL objective

θ=arg maxθEτpθ(τ)[tr(st,at)]undefinedJ(θ)=arg maxθt=1TE(st,at)pθ(st,at)[r(st,at)]\begin{split} \theta^* &= \argmax_\theta \underbrace{ \mathbb{E}_{\tau \sim p_\theta(\tau)} [\sum_t r(s_t,a_t)] }_{\mathclap{J(\theta)}} \\ &= \argmax_{\theta} \sum_{t=1}^T \mathbb{E}_{(s_t,a_t) \sim p_\theta(s_t,a_t)}[r(s_t,a_t)] \end{split}

Note: We don’t know anything about the p(s1)p(s_1) and p(si+1ai,si)p(s_{i+1}|a_i, s_i), but we can approximate it.

J(θ)=Eτpθ(τ)[tr(st,at)undefineddenote as r(τ)]=pθ(τ)r(τ)dτ1Nitr(si,t,ai,tundefinedReward of time step t in the i-th sample)\begin{split} J(\theta)&=\mathbb{E}_{\tau \sim p_\theta(\tau)}[\underbrace{\sum_t r(s_t,a_t)}_{\mathclap{\text{denote as }r(\tau)}}] \\ &= \int p_\theta(\tau)r(\tau)d\tau \\ &\approx \frac{1}{N} \sum_i \sum_t \underbrace{r(s_{i,t}, a_{i,t}}_{\mathclap{\text{Reward of time step $t$ in the $i$-th sample}}}) \end{split}
The larger our sample size N is, the more accurate our estimate will be
θJ(θ)=θpθ(τ)r(τ)dτ\nabla_\theta J(\theta)=\int \nabla_\theta p_\theta(\tau)r(\tau) d\tau

We will use a convenient identity to rewrite this equation so that it can be evaluated without knowing the distribution for pθ(τ)p_\theta(\tau)

pθ(τ)logpθ(τ)=pθ(τ)θpθ(τ)pθ(τ)=θpθ(τ)p_\theta(\tau)\nabla \log p_\theta(\tau) = p_\theta(\tau) \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)} = \nabla_\theta p_\theta(\tau)

Using this identity, we can rewrite the equation

θJ(θ)=θpθ(τ)r(τ)dτ=pθ(τ)θlogpθ(τ)r(τ)dτ\begin{split} \nabla_\theta J(\theta) &= \int \nabla_\theta p_\theta(\tau)r(\tau)d\tau \\ &= \int p_\theta(\tau)\nabla_\theta \log p_\theta(\tau)r(\tau)d\tau \end{split}

Note:

pθ(s1,a1,,sT,aT)=p(s1)t=1Tπθ(atst)p(st+1st,at)logpθ(s1,a1,,sT,aT)=logp(s1)+t=1Tlogπθ(atst)+logp(st+1st,at)p_\theta(s_1,a_1,\dots,s_T,a_T)=p(s_1)\prod_{t=1}^T \pi_\theta(a_t|s_t) p(s_{t+1}|s_t, a_t) \\ \log p_\theta(s_1,a_1,\dots,s_T,a_T)=\log p(s_1)+ \sum_{t=1}^T \log \pi_\theta(a_t|s_t) + \log p(s_{t+1}|s_t, a_t)

So if we consider the θJ(θ)\nabla_\theta J(\theta), especially the θlogpθ(τ)\nabla_\theta \log p_\theta(\tau), we see that

θlogpθ(τ)=θ[logp(s1)+t=1Tlogπθ(atst)+logp(st+1st,at)]=θt=1Tlogπθ(atst)\begin{split} \nabla_\theta \log p_\theta(\tau) &= \nabla_\theta[\log p(s_1) + \sum_{t=1}^T \log \pi_\theta(a_t|s_t)+\log p(s_{t+1}|s_t,a_t)] \\ &= \nabla_\theta \sum_{t=1}^T \log \pi_\theta(a_t|s_t) \end{split}

So therefore

θJ(θ)=Eτpθ(τ)[(t=1Tθlogπθ(atst))(t=1Tr(st,at))]1Ni=1N(t=1Tθlogπθ(ai,tsi,t)(t=1Tr(si,t,ai,t))\begin{split} \nabla_\theta J(\theta)&=\mathbb{E}_{\tau \sim p_\theta(\tau)}[(\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_t|s_t))(\sum_{t=1}^Tr(s_t,a_t))] \\ &\approx \frac{1}{N}\sum_{i=1}^N(\sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})(\sum_{t=1}^T r(s_{i,t},a_{i,t})) \end{split}

Improve by

θθ+αθJ(θ)\theta \leftarrow \theta + \alpha\nabla_\theta J(\theta)

”REINFORCE algorithm” on continuous space

e.g. We can represent pθ(atst)p_\theta(a_t|s_t) as N(fneural network(st);Σ)N(f_{\text{neural network}}(s_t);\Sigma)

θlogπθ(atst)=12Σ1(f(st)at)dfdθ\nabla_{\theta} \log \pi_\theta(a_t|s_t)=-\frac{1}{2}\Sigma^{-1}(f(s_t)-a_t)\frac{df}{d\theta}

  1. Sample {τi}\{\tau^i\} from πθ(atst)\pi_\theta(a_t|s_t) (run it on the robot)
  1. θJ(θ)i(tθlogπθ(atisti))(tr(sti,ati))\nabla_\theta J(\theta) \approx \sum_i(\sum_t \nabla_\theta \log \pi_\theta(a_t^i|s_t^i))(\sum_t r(s_t^i, a_t^i))
  1. θθ+αθJ(θ)\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)

Partial Observability

θJ(θ)1Ni=1N(t=1Nθlogπθ(ai,toi,t))(t=1Tr(si,t,ai,t))\nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N (\sum_{t=1}^N \nabla_\theta \log \pi_\theta (a_{i,t}|o_{i,t}))(\sum_{t=1}^T r(s_{i,t},a_{i,t}))
🔥
Markov property is not actually used so we can just use oi,to_{i,t} in place of si,ts_{i,t}

What’s wrong?

Blue graph corresponds to PDF of reward, green bars represent samples of rewards, and dashed blue graph corresponds to fitted policy distribution
What if reward is shifted up (adding a constant) (while everything maintains the same)?

We see huge differences in fitted policy distributions ⇒ we have a high variance problem

How we can modify policy gradient to reduce variance?

🔥
Causality: policy at time tt’ cannot affect reward at time tt if t<tt < t’
θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(t=1Tr(si,t,ai,t))\nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta\log\pi_\theta(a_{i,t}|s_{i,t})(\sum_{t'=1}^T r(s_{i,t'},a_{i,t'}))

We want to show that the expectation of the previous rewards would converge to 0 so we can rewrite the summation. Note that removing those terms in finite time would change the estimator but it is still unbiased.

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(t=tTr(si,t,ai,t))undefinedreward to go Q^i,t\nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta\log\pi_\theta(a_{i,t}|s_{i,t})\underbrace{(\sum_{t'=t}^T r(s_{i,t'},a_{i,t'}))}_{\mathclap{\text{reward to go $\hat{Q}_{i,t}$}}}

We have removed some items from the sum therefore the variance is lower.

With discount factors

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(t=tTγttr(si,t,ai,t))\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla _\theta \log \pi_\theta (a_{i,t}|s_{i,t}) (\sum_{t' = t}^T \gamma^{t' -t} r(s_{i,t'},a_{i,t}))

The power on the γ\gammattt’ - t is important because now the gradient is NOT decresing with t stepping ⇒ we’re putting less focus on the future rewards in the current step, but in future steps those rewards will be valued more again.

If we let the power be t1t’ - 1 or t1t-1 ⇒ then as we train from t=0t=0 to tt \rightarrow \infin, the gradient would decrease as tt increases

Baselines

We want policy gradients to optimize reward actions that are above average and penalize reward actions that are below average.

θJ(θ)1Ni=1Nθlogpθ(τ)[r(τ)b]\begin{split} \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^N \nabla_\theta \log p_\theta (\tau)[r(\tau)-b] \end{split}

Without baseline (b = 0), but with baseline, set b as

b=1Ni=1Nr(τ)b = \frac{1}{N}\sum_{i=1}^Nr(\tau)

Changing b will keep the esitimator unbiased but will change the variance!

E[θlogpθ(τ)b]=pθ(τ)θlogpθ(τ)bdτ=θpθ(τ)bdτ=bθpθ(τ)dτ=bθ1=0\begin{split} \mathbb{E}[\nabla_\theta \log p_\theta(\tau)b] &= \int p_\theta(\tau)\nabla_\theta \log p_\theta(\tau) b d\tau \\ &= \int \nabla_\theta p_\theta(\tau)b d\tau \\ &= b \nabla_\theta \int p_\theta(\tau) d\tau \\ &=b\nabla_\theta1 = 0 \end{split}

Average reward is not the best baseline, but it’s pretty good!

Let’s first write down variance

Var=Eτpθ(τ)[(θlogpθ(τ)(r(τ)b))2]Eτpθ(τ)[θlogpθ(τ)(r(τ)b)]2\begin{split} Var &= \mathbb{E}_{\tau \sim p_\theta(\tau)}[(\nabla_\theta \log p_\theta(\tau)(r(\tau)-b))^2]-\mathbb{E}_{\tau \sim p_\theta(\tau)}[\nabla_\theta\log p_\theta (\tau)(r(\tau)-b)]^2 \\ \end{split}
dVardb=ddbE[g(τ)2(r(τ)b)2],g(τ)=gradient(log(τ))=ddb(E[g(τ)2r(τ)2]2E[g(τ)2r(τ)b]+b2E[g(τ)2])=2E[g(τ)2r(τ)]+2bE[g(τ)2]=0\begin{split} \frac{dVar}{db} &=\frac{d}{db}\mathbb{E}[g(\tau)^2(r(\tau)-b)^2], g(\tau)=\text{gradient$(\log(\tau))$} \\ &=\frac{d}{db}(\mathbb{E}[g(\tau)^2r(\tau)^2]-2\mathbb{E}[g(\tau)^2r(\tau)b]+b^2\mathbb{E}[g(\tau)^2]) \\ &= -2\mathbb{E}[g(\tau)^2r(\tau)]+2b\mathbb{E}[g(\tau)^2]=0 \\ \end{split}

So optimal value of b

b=E[g(τ)2r(τ)]E[g(τ)2]b = \frac{\mathbb{E}[g(\tau)^2r(\tau)]}{\mathbb{E}[g(\tau)^2]}

Intuition: Different baseline for every entry in the gradient, the baseline for every parameter value is the expected value of the reward weighted by the magnitude of the gradient for the parameter value

Off-policy setting

Policy gradient is defined to be on-policy

The expectatio causes us to need to update samples every time we have updated the policy, but problem is Neural Nets only change a bit with each gradient step.

So we can modify the policy a bit

Instead of having samples from pθ(τ)p_\theta(\tau), we have samples from pˉ(τ)\bar{p}(\tau)

We can use importance sampling to accomodate this case

Exp(x)[f(x)]=Pp(x)f(x)dx=Pq(x)Pq(x)Pp(x)f(x)dx=Pq(x)Pp(x)Pq(x)f(x)dx\begin{split} \mathbb{E}_{x \sim p(x)}[f(x)] &= \int \mathbb{P}_p(x)f(x)dx \\ &=\int \frac{\mathbb{P}_q (x)}{\mathbb{P}_q (x)} \mathbb{P}_p (x) f(x) dx \\ &=\int \mathbb{P}_q(x) \frac{\mathbb{P}_p(x)}{\mathbb{P}_q(x)}f(x)dx \end{split}

Due to this expectation, we see that our objective becomes this

J(θ)=Eτpˉ(τ)[pθ(τ)pˉ(τ)r(τ)]J(\theta)=\mathbb{E}_{\tau \sim \bar{p}(\tau)}[\frac{p_\theta(\tau)}{\bar{p}(\tau)}r(\tau)]

Where

pθ(τ)=p(s1)t=1Tπθ(atst)p(st+1st,at)p_\theta(\tau)=p(s_1)\prod_{t=1}^T \pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)

Therefore,

pθ(τ)pˉ(τ)=p(s1)t=1Tπθ(atst)p(st+1st,at)p(s1)t=1Tπˉθ(atst)p(st+1st,at)=t=1Tπθ(atst)t=1Tπˉθ(atst)\begin{split} \frac{p_\theta(\tau)}{\bar{p}(\tau)} &= \frac{p(s_1)\prod_{t=1}^T \pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)}{p(s_1)\prod_{t=1}^T \bar{\pi}_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)} \\ &=\frac{\prod_{t=1}^T \pi_\theta(a_t|s_t)}{\prod_{t=1}^T \bar\pi_\theta(a_t|s_t)} \end{split}

So… we can estimate the value of some new parameters θ\theta '

J(θ)=Eτpθ(τ)[pθ(τ)pθ(τ)r(τ)]J(\theta')=\mathbb{E}_{\tau \sim p_{\theta}(\tau)}[\frac{p_{\theta'}(\tau)}{p_\theta(\tau)} r(\tau)]
θJ(θ)=Eτpθ(τ)[θpθ(τ)pθ(τ)r(τ)]=Eτpθ(τ)[pθ(τ)pθ(τ)θlogpθ(τ)r(τ)]=Eτpθ(τ)[(t=1Tπθ(atst)πθ(atst))(t=1Tθlogπθ(atst))(t=1Tr(st,at))]\begin{split} \nabla_{\theta '} J(\theta ') &= \mathbb{E}_{\tau \sim p_\theta(\tau)} [\frac{\nabla_{\theta '} p_{\theta '}(\tau)}{p_\theta(\tau)}r(\tau)] \\ &=\mathbb{E}_{\tau \sim p_\theta(\tau)} [\frac{p_{\theta '}(\tau)}{p_\theta(\tau)}\nabla_{\theta '} \log p_{\theta '} (\tau) r(\tau)] \\ &=\mathbb{E}_{\tau \sim p_\theta(\tau)}[(\prod_{t=1}^T\frac{\pi_{\theta '}(a_t|s_t)}{\pi_\theta(a_t|s_t)})(\sum_{t=1}^T \nabla_{\theta '} \log \pi_{\theta '}(a_t|s_t))(\sum_{t=1}^T r(s_t,a_t))] \end{split}

We can consider causality(the fact that current action does not affect past rewards) and

θJ(θ)=Eτpθ(τ)[t=1Tθlogπθ(atst)(t=1tπθ(atst)πθ(atst))(t=tTr(st,at)(t=ttπθ(atst)πθ(atst)))]\begin{split} \nabla_{\theta '} J(\theta ') &=\mathbb{E}_{\tau \sim p_\theta(\tau)}[\sum_{t=1}^T \nabla_{\theta '} \log \pi_{\theta '}(a_t|s_t)(\prod_{t'=1}^t \frac{\pi_{\theta '} (a_{t'}|s_{t'})}{\pi_\theta(a_{t '}|s_{t '})})(\sum_{t' = t}^Tr(s_{t'},a_{t'})(\prod_{t''=t}^{t'}\frac{\pi_{\theta '}(a_{t ''}|s_{t ''})}{\pi_\theta(a_{t ''} | s_{t ''})}))] \end{split}

Some important terms in this equation:

t=1tπθ(atst)πθ(atst)\prod_{t’=1}^t \frac{\pi_{\theta ‘}(a_{t ‘} | s_{t ’})}{\pi_{\theta}(a_{t ‘} | s_{t ’})} ⇒ future actions won’t affect current weight ⇒ however, looks like we still have the term at the end ⇒ its also a trouble because it’s exponential in TT

t=ttπθ(atst)πθ(atst)\prod_{t''=t}^{t'}\frac{\pi_{\theta '}(a_{t ''}|s_{t ''})}{\pi_\theta(a_{t ''} | s_{t ''})} ⇒ we cannot complete ignore this because stripping this term would not be gradient descent but actually”Policy Iteration” algorithm

Since the first term is exponential, we can try to write the objetive a bit differently

On-policy policy gradient

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)Q^i,t\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) \hat{Q}_{i,t}

Off-policy policy gradient

θJ(θ)1Ni=1Nt=1Tπθ(si,t,ai,t)πθ(si,t,ai,t)θlogπθ(ai,tsi,t)Q^i,t=1Ni=1Nt=1Tπθ(si,t)πθ(si,t)undefinedCan we ignore this part?πθ(ai,tsi,t)πθ(ai,tsi,t)θlogπθ(ai,tsi,t)Q^i,t\begin{split} \nabla_{\theta '} J(\theta ') &\approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \frac{\pi_{\theta '} (s_{i,t}, a_{i,t})}{\pi_\theta(s_{i,t},a_{i,t})} \nabla_{\theta '} \log \pi_{\theta '}(a_{i,t}|s_{i,t}) \hat{Q}_{i,t} \\ &\quad = \frac{1}{N}\sum_{i=1}^N \sum_{t=1}^T \underbrace{\frac{\pi_{\theta '}(s_{i,t})}{\pi_\theta(s_{i,t})}}_{\mathclap{\text{Can we ignore this part?}}} \frac{\pi_{\theta '}(a_{i,t}|s_{i,t})}{\pi_\theta(a_{i,t}|s_{i,t})} \nabla_{\theta '} \log \pi_{\theta '}(a_{i,t}|s_{i,t}) \hat{Q}_{i,t} \end{split}

This does not necessarily give the optimal policy, but its a reasonable approximation.

Policy Gradient with Automatic Differentiation

We don’t want to be inefficient at computing gradients ⇒ use backprop trick

So why don’t we start from the MLE approach to Policy Gradients?

θJML1Ni=1Nt=1Tθlogπθ(ai,tsi,t)\nabla_\theta J_{ML} \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta (a_{i,t}|s_{i,t})

To start doing backprop, let’s define the objective function first.

JML(θ)1Ni=1Nt=1Tlogπθ(ai,tsi,t)J_{ML}(\theta) \approx \frac{1}{N}\sum_{i=1}^N \sum_{t=1}^T \log \pi_\theta (a_{i,t}|s_{i,t})

So we will define our “pseudo-loss” approximation as weighted maximum likelihood

J~(θ)1Ni=1Nt=1Tlogπθ(ai,tsi,t)Q^i,t\tilde{J}(\theta) \approx \frac{1}{N}\sum_{i=1}^N \sum_{t=1}^T \log \pi_\theta (a_{i,t}|s_{i,t}) \hat{Q}_{i,t}

A few reminders about tuning hyper-parameters

  1. Gradients has very high variance
    1. Isn’t the same as supervised learning
  1. Consider using much larger batches
  1. Tweaking learning rates is VERY hard
    1. Adaptive step size rules like ADAM can be OK-ish

Covariant/natural policy gradient

🤔
One thing about one dimension state and action spaces are that you can visualize the policy distribution easily one a 2d plane
θθ+αθJ(θ)\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)

We can view the first order gradient descent as follows

θarg maxθ(θθ)θJ(θ)s.t. θθ2ϵ\theta ' \leftarrow \argmax_{\theta '} (\theta ' - \theta)^{\top}\nabla_{\theta}J(\theta) \\ \text{s.t. } ||\theta ' - \theta||^2 \le \epsilon

Thank about ϵ\epsilon as a high-dimensional sphere.

We can instead put constraint on the distribution of the policy

θarg maxθ(θθ)θJ(θ)s.t. D(πθ,πθ)ϵ\theta ' \leftarrow \argmax_{\theta '} (\theta ' - \theta)^{\top}\nabla_{\theta}J(\theta) \\ \text{s.t. } D(\pi_{\theta '}, \pi_\theta) \le \epsilon

Where D(,)D(\cdot, \cdot) is a distribution dimension measure.

So we want to choose some parameterization-independent divergence measure

Usually KL-divergence

DKL(πθπθ)=Eπθ[logπθlogπθ]D_{KL}(\pi_{\theta '}||\pi_\theta) = \mathbb{E}_{\pi_{\theta '}}[\log \pi_\theta - \log \pi_{\theta '}]

A but hard to plug into our gradient descent ⇒ we will approximate this distance using 2nd order Taylor expansion around the current parameter value θ\theta

Advanced Policy Gradients

Why does policy gradient work?

  1. Estimate A^π(st,at)\hat{A}^\pi (s_t,a_t) (Monte-carlo or function approximator) for current policy π\pi
  1. Use A^π(st,at)\hat{A}^\pi(s_t,a_t) to get improved policy π\pi '

We are going back and forth between (1) and (2)

Looks familiar to policy iteration algorithm!

Nice thing about policy gradients:

  1. If advantage estimator is perfect, we are just moving closer to the “optimal” policy for the advantage estimator not jumping to the optimal

But how do we formalize this?

If we set the loss function of policy gradient as

J(θ)=Eτpθ(τ)[tγtr(st,at)]J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)}[\sum_t \gamma^t r(s_t,a_t)]

Then we can calculate how much we improved by

J(θ)J(θ)=Eτpθ(τ)[tγtAπθ(st,at)]J(\theta ') - J(\theta) = \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}[\sum_t \gamma^t A^{\pi_\theta}(s_t,a_t)]

Here we claimed that J(θ)J(θ)J(\theta’) - J(\theta) is equal to the expectation of all discounted advantage estimates in the new trajectory that the new policy would explore.

We will prove by:

J(θ)J(θ)=J(θ)Es0p(s0)[Vπθ(s0)]=J(θ)Eτpθ(τ)[Vπθ(s0)]undefinedIt’s the same as above because the initial states s0 are the same=J(θ)Eτpθ(τ)[t=0γtVπθ(st)t=1γtVπθ(st)]=J(θ)+Eτpθ(τ)[t=0γt(γVπθ(st+1)Vπθ(st))]=Eτpθ(τ)[t=0γtr(st,at)]+Eτpθ(τ)[t=0γt(γVπθ(st+1)Vπθ(st))]=Eτpθ(τ)[t=0γt(r(st,at)+γVπθ(st+1)Vπθ(st))]=Eτpθ(τ)[t=0γtAπθ(st,at)]\begin{split} J(\theta')-J(\theta)&=J(\theta')-\mathbb{E}_{s_0 \sim p(s_0)} [V^{\pi_\theta}(s_0)] \\ &= J(\theta') - \underbrace{\mathbb{E}_{\tau \sim p_{\theta '}(\tau)}[V^{\pi_\theta}(s_0)]}_{\mathclap{\text{It's the same as above because the initial states $s_0$ are the same}}} \\ &= J(\theta') - \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}[\sum_{t=0}^\infin \gamma^t V^{\pi_\theta}(s_t) - \sum_{t=1}^\infin \gamma^t V^{\pi_\theta}(s_t)] \\ &= J(\theta') + \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}[\sum_{t=0}^\infin \gamma^t (\gamma V^{\pi_\theta}(s_{t+1}) - V^{\pi_\theta}(s_t))] \\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}[\sum_{t=0}^\infin \gamma^t r(s_t,a_t)] + \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}[\sum_{t=0}^\infin \gamma^t (\gamma V^{\pi_\theta}(s_{t+1}) - V^{\pi_\theta}(s_t))] \\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}[\sum_{t=0}^\infin \gamma^t (r(s_t,a_t) + \gamma V^{\pi_\theta}(s_{t+1}) - V^{\pi_\theta}(s_t))] \\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}[\sum_{t=0}^\infin \gamma^t A^{\pi_\theta}(s_t,a_t)] \end{split}
🔥
This proof shows that by optimizing Eτpθ[tγtAπθ(st,at)]\mathbb{E}_{\tau \sim p_{\theta’}}[\sum_t \gamma^t A^{\pi_\theta}(s_t,a_t)], we are actually improving our policy
But how does this relate to our policy gradient?
Eτ pθ(τ)[tγtAπθ(st,at)]=tEstpθ(st)[Eatπθ(atst)[γtAπθ(st,at)]]=tEstpθ(st)[Eatπθ(atst)[πθ(atst)πθ(atst)undefinedimportance samplingγtAπθ(st,at)]]tEstpθ(st)undefinedignoring distribution mismatch[Eatπθ(atst)[πθ(atst)πθ(atst)γtAπθ(st,at)]]=Aˉ(θ)\begin{split} \mathbb{E}_{\tau\ \sim p_{\theta'}(\tau)}[\sum_t \gamma^t A^{\pi_\theta}(s_t,a_t)] &= \sum_t \mathbb{E}_{s_t \sim p_{\theta'}(s_t)} [\mathbb{E}_{a_t \sim \pi_{\theta '}(a_t|s_t)}[\gamma^t A^{\pi_\theta} (s_t,a_t)]] \\ &= \sum_t \mathbb{E}_{s_t \sim p_{\theta'}(s_t)} [\underbrace{\mathbb{E}_{a_t \sim \pi_{\theta}(a_t|s_t)}[\frac{\pi_{\theta '}(a_t|s_t)}{\pi_\theta(a_t|s_t)}}_{\mathclap{\text{importance sampling}}} \gamma^t A^{\pi_\theta} (s_t,a_t)]] \\ &\approx \sum_t \underbrace{\mathbb{E}_{s_t \sim p_{\theta}(s_t)}}_{\mathclap{\text{ignoring distribution mismatch}}} [\mathbb{E}_{a_t \sim \pi_{\theta}(a_t|s_t)}[\frac{\pi_{\theta '}(a_t|s_t)}{\pi_\theta(a_t|s_t)} \gamma^t A^{\pi_\theta} (s_t,a_t)]] \\ &\quad =\bar{A}(\theta') \end{split}

We want this approximation to be true so that

J(θ)J(θ)Aˉ(θ)J(\theta')-J(\theta) \approx \bar{A}(\theta')

And then we can use

θarg maxθAˉ(θ)\theta' \leftarrow \argmax_{\theta'} \bar{A}(\theta)

to optimize our policy

And use A^π(st,at)\hat{A}^\pi(s_t,a_t) to get improved policy π\pi’

But when is this true?

Bounding the distribution change (deterministic)

We will show:

🔥
pθ(st)p_\theta(s_t) is close to pθ(st)p_{\theta’}(s_t) when πθ\pi_\theta is close to πθ\pi_\theta’

We will assume that πθ\pi_\theta is a deterministic policy at=πθ(st)a_t = \pi_\theta(s_t)

We define closeness of policy as:

πθ\pi_{\theta’} is close to πθ\pi_\theta if πθ(atπθ(st)st)ϵ\pi_{\theta’}(a_t \ne \pi_\theta(s_t)|s_t) \le \epsilon
pθ(st)=(1ϵ)tundefinedprobability we made no mistakespθ(st)+(1(1ϵ)t)pmistake(st)undefinedsome other distributionp_{\theta'}(s_t) = \underbrace{(1-\epsilon)^t}_{\mathclap{\text{probability we made no mistakes}}} p_\theta(s_t)+(1-(1-\epsilon)^t)\underbrace{p_{mistake}(s_t)}_{\mathclap{\text{some other distribution}}}

Implies that we can write the total variation divergence as

pθ(st)pθ(st)=(1(1ϵ)t)pmistake(st)pθ(st)2(1(1ϵ)t)2ϵt\begin{split} |p_{\theta'}(s_t)-p_\theta(s_t)| &= (1-(1-\epsilon)^t) |p_{mistake}(s_t)-p_\theta(s_t)| \\ &\le 2(1-(1-\epsilon)^t) \\ &\le 2\epsilon t \end{split}
👉
Useful Identity ϵ[0,1],(1ϵ)t1ϵt\forall \epsilon \in [0,1], (1-\epsilon)^t \ge 1 - \epsilon t

Bounding the distribution change (arbitrary)

Schulman, Levine, Moritz, Jordan, Abbeel, “Trust Region Policy Optimization”

We will show:

🔥
pθ(st)p_\theta(s_t) is close to pθ(st)p_{\theta’}(s_t) when πθ\pi_\theta is close to πθ\pi_\theta’

Closeness:

πθ\pi_{\theta’} is close to πθ\pi_\theta if πθ(atst)πθ(atst)ϵ|\pi_{\theta’}(a_t|s_t) - \pi_\theta(a_t|s_t)| \le \epsilon for all sts_t

Useful Lemma:

If pX(x)pY(y)=ϵ|p_X(x) - p_Y(y)| = \epsilon, there exists p(x,y)p(x,y) such that p(x)=pX(x)p(x) = p_X(x) and p(y)=pY(y)p(y) = p_Y(y) and p(x=y)=1ϵp(x=y) = 1 - \epsilon

Says:

  1. pX(x)p_X(x) agrees with pY(y)p_Y(y) with probability ϵ\epsilon
  1. πθ(atst)\pi_{\theta'}(a_t|s_t) takes a different action than πθ(atst)\pi_{\theta}(a_t|s_t) with probability at most ϵ\epsilon

Therefore

pθ(st)pθ(st)=(1(1ϵ)t)pmistake(st)pθ(st)2(1(1ϵ)t)2ϵt\begin{split} |p_{\theta'}(s_t)-p_{\theta}(s_t)| &= (1-(1-\epsilon)^t)|p_{mistake}(s_t)-p_\theta(s_t)| \\ &\le 2(1-(1-\epsilon)^t) \\ &\le 2 \epsilon t \end{split}

Bounding the objective value

Assume pθ(st)pθ(st)2ϵt|p_{\theta’}(s_t)-p_{\theta}(s_t)| \le 2\epsilon t

Estpθ(st)[f(st)]=stpθ(st)f(st)stpθ(st)f(st)pθ(st)pθ(st)maxstf(st)Epθ(st)[f(st)]2ϵtmaxstf(st)\begin{split} \mathbb{E}_{s_t \sim p_{\theta'}(s_t)}[f(s_t)] &= \sum_{s_t} p_{\theta'}(s_t)f(s_t) \\ &\ge \sum_{s_t} p_{\theta}(s_t)f(s_t) - |p_{\theta}(s_t)-p_{\theta'}(s_t)|\max_{s_t}f(s_t) \\ &\ge \mathbb{E}_{p_\theta(s_t)}[f(s_t)] - 2\epsilon t \max_{s_t} f(s_t) \end{split}

For the objective value, we can bound it by

Eτ pθ(τ)[tγtAπθ(st,at)]=tEstpθ(st)[Eatπθ(atst)[πθ(atst)πθ(atst)undefinedimportance samplingγtAπθ(st,at)]]tEstpθ(st)[Eatπθ(atst)[πθ(atst)πθ(atst)γtAπθ(st,at)]]t2ϵtC\begin{split} \mathbb{E}_{\tau\ \sim p_{\theta'}(\tau)}[\sum_t \gamma^t A^{\pi_\theta}(s_t,a_t)] &= \sum_t \mathbb{E}_{s_t \sim p_{\theta'}(s_t)} [\underbrace{\mathbb{E}_{a_t \sim \pi_{\theta}(a_t|s_t)}[\frac{\pi_{\theta '}(a_t|s_t)}{\pi_\theta(a_t|s_t)}}_{\mathclap{\text{importance sampling}}} \gamma^t A^{\pi_\theta} (s_t,a_t)]] \\ & \ge \sum_t \mathbb{E}_{s_t \sim p_{\theta}(s_t)} [\mathbb{E}_{a_t \sim \pi_{\theta}(a_t|s_t)}[\frac{\pi_{\theta '}(a_t|s_t)}{\pi_\theta(a_t|s_t)} \gamma^t A^{\pi_\theta} (s_t,a_t)]] - \sum_{t} 2\epsilon tC \\ \end{split}

Where CC is a constant being the largest value that the thing in the state exepectation can take on ⇒ its a probability times an advantage ⇒ largest possible reward times the number of time steps.

So:

CO(Trmax) or O(rmax1γ)undefinedinfinite time with discount (geometric series)C \in O(Tr_{max}) \text{ or } \underbrace{O(\frac{r_{max}}{1-\gamma})}_{\mathclap{\text{infinite time with discount (geometric series)}}}

Therefore for small enought ϵ\epsilon, Aˉπ(θ)\bar{A}^{\pi}(\theta ') is guarenteed to be similar to the true RL objective

Policy Gradients with Constraints

We want to constrain:

πθ(atst)πθ(atst)ϵ|\pi_{\theta'}(a_t|s_t) - \pi_{\theta}(a_t|s_t)| \le \epsilon

It which will result in

pθ(st)pθ(st)2ϵt|p_{\theta'}(s_t)-p_\theta(s_t)| \le 2\epsilon t

A more convenient bound:

πθ(atst)πθ(atst)12DKL(πθ(atst),πθ(atst))|\pi_{\theta'}(a_t|s_t) - \pi_{\theta}(a_t|s_t)| \le \sqrt{\frac{1}{2} D_{KL}(\pi_{\theta'}(a_t|s_t),\pi_{\theta}(a_t|s_t))}

We see that the KL divergence bounds the state marginal difference

Note:

DKL(p1(x),p2(x))=Exp1(x)[logp1(x)p2(x)]D_{KL}(p_1(x),p_2(x)) = \mathbb{E}_{x \sim p_1(x)} [\log \frac{p_1(x)}{p_2(x)}]
👉
KL Divergence has very convenient properties that make it much easier to approximate

So contraining our objective, we have:

θarg maxθtEstpθ(st)[Eatπθ(atst)[πθ(atst)πθ(atst)γtAπθ(st,at)]]s.t. DKL(πθ(atst),πθ(atst))ϵ\theta' \leftarrow \argmax_{\theta'} \sum_t \mathbb{E}_{s_t \sim p_{\theta}(s_t)}[\mathbb{E}_{a_t \sim \pi_\theta(a_t|s_t)}[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^t A^{\pi_\theta} (s_t,a_t)]] \\ \text{s.t. } D_{KL}(\pi_{\theta '}(a_t|s_t), \pi_\theta(a_t|s_t)) \le \epsilon

For small ϵ\epsilon, this is guaranteed to improve J(θ)J(θ)J(\theta’) - J(\theta)

👉
We can do this by simply write out the lagrangian function and do convex optimization
L(θ,λ)=tEstpθ(st)[Eatπθ(atst)[πθ(atst)πθ(atst)γtAπθ(st,at)]]λ(DKL(πθ(atst),πθ(atst))ϵ)L(\theta',\lambda) = \sum_t \mathbb{E}_{s_t \sim p_{\theta}(s_t)}[\mathbb{E}_{a_t \sim \pi_\theta(a_t|s_t)}[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^t A^{\pi_\theta} (s_t,a_t)]] - \lambda(D_{KL}(\pi_{\theta '}(a_t|s_t), \pi_\theta(a_t|s_t)) - \epsilon)

“Dual Gradient Descent” Alternate between:

  1. Maximize L(θ,λ)L(\theta’, \lambda) with respect to θ\theta’
  1. λλ+α(DKL(πθ(atst),πθ(atst))ϵ)\lambda \leftarrow \lambda + \alpha(D_{KL}(\pi_{\theta’}(a_t|s_t),\pi_\theta(a_t|s_t)) - \epsilon)

Intuition: raise λ\lambda if constraint is violated too much

Natural Gradient

As usual, we denote Aˉ(θ)\bar{A}(\theta’) as

Aˉ(θ)=tEstpθ(st)[Eatπθ(atst)[πθ(atst)πθ(atst)γtAπθ(st,at)]]\bar{A}(\theta') = \sum_t \mathbb{E}_{s_t \sim p_{\theta}(s_t)}[\mathbb{E}_{a_t \sim \pi_\theta(a_t|s_t)}[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^t A^{\pi_\theta} (s_t,a_t)]]
When we are doing gradient descent, we are actually optimizing near a region(trust region) in the first order taylor expansion of the objective.

Can we use this fact to simplify the optimization step and use one easy algorithm to take care of this constrained optimization problem?

θarg maxθθAˉ(θ)(θθ)s.t. DKL(πθ(atst),πθ(atst))ϵ\theta' \leftarrow \argmax_{\theta'} \nabla_{\theta} \bar{A}(\theta)^{\top}(\theta'-\theta) \\ \text{s.t. } D_{KL}(\pi_{\theta'}(a_t|s_t),\pi_{\theta}(a_t|s_t)) \le \epsilon

Let’s look at the gradient of Aˉ(θ)\bar{A}(\theta’)

θAˉ(θ)=tEstpθ(st)[Eatπθ(atst)[πθ(atst)πθ(atst)γtθlogπθ(atst)Aπθ(st,at)]]\nabla_{\theta'}\bar{A}(\theta) = \sum_t \mathbb{E}_{s_t \sim p_\theta(s_t)}[\mathbb{E}_{a_t \sim \pi_\theta(a_t|s_t)}[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^t \nabla_\theta \log \pi_\theta(a_t|s_t) A^{\pi_\theta}(s_t,a_t)]] \\

Evaluating this at point θ=θ\theta’ = \theta

θAˉ(θ)=tEstpθ(st)[Eatπθ(atst)[γtθlogπθ(atst)Aπθ(st,at)]]=θJ(θ)\begin{split} \nabla_{\theta}\bar{A}(\theta) &= \sum_t \mathbb{E}_{s_t \sim p_\theta(s_t)}[\mathbb{E}_{a_t \sim \pi_\theta(a_t|s_t)}[\gamma^t \nabla_\theta \log \pi_\theta(a_t|s_t) A^{\pi_\theta}(s_t,a_t)]] \\ &= \nabla_\theta J(\theta) \end{split}

Can we just use the gradient then?

Remember in the original policy gradient algorithm, θθ+αθJ(θ)\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)

But some parameters change probabilities a lot more than others, we are not considering the KL Divergence in this case

⚠️
so we want to either form the policy to let them have equal effects, or we can change the learning rate for each of the parameters

But it seems like the problem can be fixed, notice that

Gradient Descent still poses some kind of constraint θθ2ϵ||\theta - \theta’||^2 \le \epsilon

Think about in parameter space, we are constrained to within some fixed distance(inside a circle) from the original parameter. ⇒ Indeed the learning rate α=ϵθJ(θ)2\alpha = \sqrt{\frac{\epsilon}{||\nabla_\theta J(\theta)||^2}} can satisfy this constraint.

But in KL divergence, maybe we want an ellipse? (we want a circle in distribution space)

We will use second order taylor expansion to approximate the KL divergence to make it easier to calculate.

DKL(πθπθ)(θθ)FundefinedFisher-information Matrix(θθ)D_{KL}(\pi_{\theta '} || \pi_\theta) \approx (\theta '-\theta)\underbrace{F}_{\mathclap{\text{Fisher-information Matrix}}}(\theta ' - \theta)
F=Eπθ[θlogπθ(as)θlogπθ(as)]F = \mathbb{E}_{\pi_{\theta}}[\nabla_\theta \log \pi_\theta(a|s) \nabla_\theta \log \pi_\theta(a|s)^{\top}]

Now we see that the shape of the new ellipse in parameter space is determined by the matrix FF

We can estimate this FF expectation using samples

So therefore!

Using the approximation, our policy becomes

θarg maxθ(θθ)θJ(θ)s.t. θθF2ϵ\theta \leftarrow \argmax_{\theta '} (\theta ' - \theta)^{\top}\nabla_\theta J(\theta) \\ \text{s.t. } ||\theta ' - \theta||_F^2 \le \epsilon

With this we have the “natural gradeint”

θθ+αF1θJ(θ)\theta \leftarrow \theta + \alpha F^{-1} \nabla_\theta J(\theta)

If we want to enforce the constraint, we can choose a step size that enforces the constraint.

We can calculate step size α\alpha by

α=2ϵθJ(θ)FθJ(θ)\alpha = \sqrt{\frac{2 \epsilon}{\nabla_\theta J(\theta)^{\top} F \nabla_\theta J(\theta)}}

Or run conjugate gradient method to calculate inverse θJ(θ)\nabla_\theta J(\theta) and get α\alpha as a by-product (see trust region policy optimization paper)

Several policy that utilizes this trick:

  1. natural gradient: pick α\alpha
  1. trust region policy optimization: pick ϵ\epsilon
  1. Can solve for optimal α\alpha while solving F1θJ(θ)F^{-1} \nabla_\theta J(\theta)
  1. Conjugate gradient works well for this

Recommended Readings

  1. Natural Policy Gradient
    1. Schulman, L., Moritz, Jordan, Abbeel (2015) Trust region policy optimization
    1. Peters, Schaal. Reinforcement Learning of motor skills with policy gradients
  1. IS objective directly
    1. Proximal Policy Optimization (PPO)

Is this even a problem in practice?

👉
It’s a pretty big problem especially with continuous action spaces. Generally a good choice to stabilize policy gradient training

With this problem, the vanilla policy gradients do point in the right direction but they are extremely ill-posed (too busy decreasing σ\sigma.

Actor-Critic Method

Remember from the policy gradient that

θJ(θ)=Eτpθ(τ)[(t=1Tθlogπθ(atst))(t=1Tr(st,at))]1Ni=1N(t=1Tθlogπθ(ai,tsi,t)(t=1Tr(si,t,ai,t)undefinedReward to go Q^i,t)\begin{split} \nabla_\theta J(\theta)&=\mathbb{E}_{\tau \sim p_\theta(\tau)}[(\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_t|s_t))(\sum_{t=1}^Tr(s_t,a_t))] \\ &\approx \frac{1}{N}\sum_{i=1}^N(\sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})(\underbrace{\sum_{t=1}^T r(s_{i,t},a_{i,t})}_{\mathclap{\text{Reward to go $\hat{Q}_{i,t}$}}}) \end{split}

Q^i,t\hat{Q}_{i,t} is the estimate of expected reward if we take action ai,ta_{i,t} in state si,ts_{i,t}

But Q^i,t\hat{Q}_{i,t} currently has a very high variance

  1. Q^i,t\hat{Q}_{i,t} is only taking account of a specific chain of action-state pairs
    1. This is because we approximated the gradient by stripping away the expectation

We can get a better estimate by using the true expected reward-to-go

Qπ(st,at)=t=tTEπθ[r(st,at)st,at]Q^{\pi}(s_t,a_t)=\sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t '},a_{t'})|s_t,a_t]
What about baseline? Can we apply a baseline even if we have a true Q function?
Vπ(st)=Eatπθ(atst)[Qπ(st,at)]V^{\pi}(s_t)=\mathbb{E}_{a_t \sim \pi_\theta(a_t|s_t)}[Q^{\pi}(s_t,a_t)]

Turns out that we can do better(less variance) than applying bt=1NiQ(si,t,ai,t)b_t = \frac{1}{N} \sum_i Q(s_{i,t},a_{i,t}) because with the value function, the baseline can be dependent on the state.

So now our gradient becomes

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(Qπ(si,t,ai,t)Vπ(si,t))undefinedHow much better the action ai,t is than the average action1Ni=1Nt=1Tθlogπθ(ai,tsi,t)Aπ(si,t,ai,t)\begin{split} \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) \underbrace{(Q^{\pi}(s_{i,t},a_{i,t})-V^{\pi}(s_{i,t}))}_{\mathclap{\text{How much better the action $a_{i,t}$ is than the average action}}} \\ &\approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) A^{\pi}(s_{i,t},a_{i,t}) \end{split}

We will also name something new: the advantage function:

How much better the action ai,ta_{i,t} is than the average action

Aπ(st,at)=Qπ(st,at)Vπ(st)A^{\pi}(s_t,a_t)=Q^{\pi}(s_t,a_t)-V^{\pi}(s_t)

The better estimate AπA^{\pi} estimate has, the lower variance θJ(θ)\nabla_\theta J(\theta) has.

Fitting Q/Value/Advantage Functions

But problem is, what should we fit? Qπ,Vπ,AπQ^{\pi}, V^{\pi}, A^{\pi}?

Let’s do some approximation and find out

Qπ(st,at)=r(st,at)+γEst+1p(st+1st,at)[Vπ(st+1)]r(st,at)+γVπ(st+1)\begin{split} Q^{\pi}(s_t,a_t)&=r(s_t,a_t)+\gamma\mathbb{E}_{s_{t+1} \sim p(s_{t+1}|s_t,a_t)}[V^{\pi}(s_{t+1})] \\ &\approx r(s_t,a_t)+\gamma V^{\pi}(s_{t+1}) \end{split}
Aπr(st,at)+γVπ(st+1)Vπ(st)A^{\pi} \approx r(s_t,a_t)+\gamma V^{\pi}(s_{t+1})-V^{\pi}(s_t)

We will introduce what γ\gamma(discount factor) is later. Just view it as γ=1\gamma=1 for now.

So now we see that it is now easy to fit VπV^{\pi} and use VπV^{\pi} to approximate QπQ^{\pi} and AπA^{\pi}.

VπV^{\pi} is relatively easier to fit to because it does not involve action, only depends on state.

🔥
Note that Actor-Critic can also fit QπQ^{\pi}.

Policy Evaluation

This is what policy gradient does
Vπ(st)t=tTr(st,at)V^{\pi}(s_t) \approx \sum_{t' = t}^T r(s_{t '}, a_{t' })

Ideally we want to do this:

Vπ(st)1Ni=1Nt=tTr(st,at)V^{\pi}(s_t) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t'=t}^T r(s_{t'},a_{t'})

Because in a model-free setting we cannot reset back to a state and run multiple trials

Expectation of rewards

So…

Monte Carlo policy evaluation
Use empirical returns to train the a value function to approximate the expectation
We can use a neural net

Instead of using those rewards directly to a policy gradient we will fit a model to those rewards ⇒ will reduce variance

Because even though we cannot visit the same state twice, the function approximator will combine information with similar states

And we can of course use MSE, etc. - supervised training losses

Ideal target

yi,t=t=tTEπθ[r(st,at)si,t]r(si,t,ai,t)+t=t+1TEπθ[r(st,at)si,t+1]y_{i,t} = \sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_{i,t}] \approx r(s_{i,t},a_{i,t})+\sum_{t' = t+1}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_{i,t+1}]

Monte Carlo target:

yi,t=t=tTr(si,t,ai,t)y_{i,t} = \sum_{t' = t}^T r(s_{i,t'},a_{i,t'})

Training data would be:

{(si,t,t=tTr(si,t,ai,t)undefinedyi,t)}\{(s_{i,t},\underbrace{\sum_{t'=t}^T r(s_{i,t'},a_{i,t'})}_{\mathclap{\text{$y_{i,t}$}}})\}

We can do even better (bootstrapped estimate):

Hmmm… looks like we can modify a bit in the ideal target

yi,t=t=tTEπθ[r(st,at)si,t]r(si,t,ai,t)+Vπ(si,t+1)y_{i,t} = \sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_{i,t}] \approx r(s_{i,t}, a_{i,t}) + V^{\pi}(s_{i,t+1})

Since we don’t know VπV^{\pi}, we can approximate it by V^ϕπ(si,t+1)\hat{V}_{\phi}^{\pi}(s_{i,t+1}) - our previous value function approximator (bootstraped estimate)

yi,t=t=tTEπθ[r(st,at)si,t]r(si,t,ai,t)+V^ϕπ(si,t+1)y_{i,t} = \sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_{i,t}] \approx r(s_{i,t}, a_{i,t}) + \hat{V}_{\phi}^{\pi}(s_{i,t+1})

So now training data:

{(si,t,r(si,t,ai,t)+V^ψπ(si,t+1)}\{(s_{i,t},r(s_{i,t},a_{i,t})+\hat{V}^{\pi}_{\psi} (s_{i,t+1})\}
🔥
Lower variance than Monte Carlo evaluation, but higher bias (because V^ϕπ\hat{V}_{\phi}^{\pi} might be incorrect)

Batch actor-critic algorithm

⚠️
The fitted value function is not guarenteed to merge, same reason as the section “Value Function Learning Theory”

Discount Factor

If TT \rightarrow \infin, V^ϕπ\hat{V}_\phi^\pi ⇒ the approximator for VπV^{\pi} can get infinitely large in many cases, so we can regulate the reward to be “sooner rather than later”.

yi,tr(si,t,ai,t)+γV^ϕπ(si,t+1)y_{i,t} \approx r(s_{i,t},a_{i,t}) + \gamma \hat{V}_\phi^\pi(s_{i,t+1})

Where γ[0,1]\gamma \in [0,1] is the discount factor, it let’s reward you get decay in every timestep ⇒ So that the obtainable reward in infinite lifetime can actually be bounded.

One understanding that γ\gamma affects policy is that γ\gamma adds a “death state” that once you’re in, you can never get out.

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(t=tTγttr(si,t,ai,t))\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})(\sum_{t' = t}^T \gamma^{t' - t} r(s_{i,t'},a_{i,t'}))

Online Actor-critic Algorithm

In practice:

Off-policy Actor-critic algorithms

Idea: Collect data and instead of training on them directly, put them into a replay buffer. At training instead of using the data just collected, fetch one randomly from the replay buffer.

Coming from the online algorithm, let’s see what problems do we need to fix:

(1) Under current policy, it is possible that our policy would not even have taken the next action aia_i and therefore it’s not cool to assume we will arrive at the reward r(si,ai,si)r(s_i,a_i,s_i') ⇒ We may not even arrive at state sis_i’

(2) Same reason, we may not have taken aia_i as our action under the current policy


We can fix the problem in (1) by using Qπ(st,at)Q^{\pi}(s_t,a_t) ⇒ replace the term γV^ϕπ(si)\gamma \hat{V}_\phi^\pi(s_i’)

Now

L(ϕ)=1NiQ^ϕπ(si,ai)yi)2L(\phi)=\frac{1}{N}\sum_i ||\hat{Q}_\phi^\pi(s_i,a_i)-y_i)||^2

And we replace the target value

yi=ri+γQ^ϕπ(si,aiπundefinedthis is sampled from current policy, aiπθ(aiπsi))y_i = r_i + \gamma \hat{Q}^\pi_\phi (s_i',\underbrace{a_i'^\pi}_{\mathclap{\text{this is sampled from current policy, $a_i'\sim \pi_\theta(a_i'^\pi|s_i')$}}})

Same for (2), sample an action from current policy aiππθ(aisi)a_i^\pi \sim \pi_\theta(a_i|s_i) rather than using the original data

And instead of plugging in advantage function,

θJ(θ)1Niθlogπθ(aiπsi)A^π(si,aiπ)1Niθlogπθ(aiπsi)Q^π(si,aiπ)undefinedHigher variance\begin{split} \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_i \nabla_\theta \log \pi_\theta (a_i^\pi|s_i) \hat{A}^{\pi}(s_i,a_i^\pi) \\ &\approx \frac{1}{N} \sum_i \nabla_\theta \log \pi_\theta (a_i^\pi|s_i) \underbrace{\hat{Q}^{\pi}(s_i,a_i^\pi)}_{\mathclap{\text{Higher variance}}} \\ \end{split}

It’s fine to have high variance (not being baselined), because it’s easier and now we don't need to generate more states ⇒ we can just generate more actions

In exchange use a larger batch size ⇒ all good!

🔥
Still one problem left ⇒ sis_i did not come from pθ(s)p_\theta(s) ⇒ but nothing we can do Not too bad ⇒ Initially we want optimal policy on pθ(s)p_\theta(s), but we actually get optimal policy on a broader distribution

Now our final result:

Example Practical Algorithm: Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. 2018.

Critics as state-dependent baselines

In actor-critic

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(Qπ(si,t,ai,t)Vπ(si,t))1Ni=1Nt=1Tθlogπθ(ai,tsi,t)Aπ(si,t,ai,t)\begin{split} \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) (Q^{\pi}(s_{i,t},a_{i,t})-V^{\pi}(s_{i,t})) \\ &\approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) A^{\pi}(s_{i,t},a_{i,t}) \end{split}

This method of using fitted model to approximate value/Q/Advantage Function:

  1. Lowers variance
  1. Biased as long as the critic is not perfect

In policy gradient:

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)((t=tTγttr(si,t,ai,t))b)\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta (a_{i,t}|s_{i,t})((\sum_{t'=t}^T \gamma^{t'-t}r(s_{i,t'},a_{i,t'}))-b)

This method:

  1. No bias
  1. High variance

So we can use a state-dependent baseline to still keep the estimator unbiased but reduce a bit variance?

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)((t=tTγttr(si,t,ai,t))V^ϕπ(si,t))\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta (a_{i,t}|s_{i,t}) ((\sum_{t'=t}^T \gamma^{t'-t} r(s_{i,t'},a_{i,t'}))-\hat{V}_\phi^\pi(s_{i,t}))
Not only does the policy gradient remain unbiased when you subtract any constant b, it still remains unbiased when you subtract any function that only depends on the state sis_{i} (and not on the action)

Control variates - methods that use state and action dependent baselines

A^π(st)=t=tγttr(st,at)Vϕπ(st)\hat{A}^{\pi}(s_t) = \sum_{t'=t}^\infin \gamma^{t'-t}r(s_{t'},a_{t'})-V_\phi^\pi(s_t)
  1. No bias
  1. Higher variance (because single-sample estimate)

A^π(st,at)=t=tγttr(st,at)Qϕπ(st,at)\hat{A}^\pi(s_t,a_t)=\sum_{t'=t}^\infin \gamma^{t'-t}r(s_{t'},a_{t'})-Q_\phi^\pi(s_t,a_t)
  1. Goes to zero in expectation if critic is correct
  1. If critic is not correct, bomb shakalaka
    1. The expectation integrates to an error term that needs to be compensated for

To account for the error in baseline, we modify it to:

θJ(θ)1Ni=1Ni=1Tθlogπθ(ai,tsi,t)(Q^i,tQϕπ(si,t,ai,t))+1Ni=1Nt=1TθEaπθ(at,si,t)[Qϕπ(si,t,at)]\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{i=1}^T \nabla_\theta \log \pi_\theta (a_{i,t}|s_{i,t})(\hat{Q}_{i,t}-Q_\phi^\pi(s_{i,t},a_{i,t})) + \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \nabla_\theta \mathbb{E}_{a \sim \pi_\theta(a_t,s_{i,t})}[Q^{\pi}_\phi(s_{i,t},a_t)]
Use a critic without the bias, provided second term can be evaluated Gu et al. 2016 (Q-Prop)

Eligibility traces & n-step returns

Again, the Critic advantage estimator

A^Cπ(st,at)=r(st,at)+γV^ϕπ(st+1)V^ϕπ(st)\hat{A}_C^\pi(s_t,a_t)=r(s_t,a_t)+\gamma \hat{V}_\phi^\pi(s_{t+1})-\hat{V}_\phi^\pi(s_t)
  1. Lower variance
  1. higher bias if value estimate is wrong

The Monte Carlo Advantage Estimator

A^MCπ(st,at)=t=tγttr(st,at)V^ϕπ(st)\hat{A}_{MC}^\pi(s_t,a_t) = \sum_{t'=t}^\infin \gamma^{t'-t} r(s_{t'},a_{t'})-\hat{V}_\phi^\pi(s_t)
  1. unbiased
  1. higher variance (single-sample estimate)

Get something in-between?

Facts:

  1. rewards are smaller as tt’ \rightarrow \infin
    1. So biases are much smaller problems when tt’ is big
  1. Variance is more of a problem in the future

A^nπ(st,at)=t=tt+nγttr(st,at)+γnV^ϕπ(st+n)V^ϕπ(st)\hat{A}_n^\pi(s_t,a_t)=\sum_{t'=t}^{t+n} \gamma^{t'-t}r(s_{t'},a_{t'})+\gamma^{n}\hat{V}_\phi^\pi(s_{t+n})-\hat{V}_\phi^\pi(s_t)

Generalized advantage estimation

The n-step advantage estimator is good, but can we generalize(hybrid) it?

Instead of hard cutting between two estimators, why don’t we combine them everywhere?

A^GAEπ(st,at)=n=1wnA^nπ(st,at)\hat{A}_{GAE}^{\pi}(s_t,a_t) = \sum_{n=1}^\infin w_n \hat{A}_n^\pi(s_t,a_t)

Where A^nπ\hat{A}_n^\pi stands for the n-step estimator

“Most prefer cutting earlier (less variance)”

So we set wnλn1w_n \propto \lambda^{n-1} ⇒ exponential falloff

Then

A^GAEπ(st,at)=t=t(γλ)ttδt, where δt=r(st,at)+γV^ϕπ(st+1)V^ϕπ(st)\begin{split} \hat{A}_{GAE}^\pi(s_t,a_t)=\sum_{t'=t}^\infin (\gamma \lambda)^{t'-t} \delta_{t'} \\ \text{, where } \delta_{t '}=r(s_{t '},a_{t'})+\gamma\hat{V}_\phi^\pi(s_{t'+1})-\hat{V}_\phi^\pi(s_{t'}) \end{split}
🔥
Now we can just tradeoff the bias and variances using the parameter λ\lambda

Examples of actor-critic algorithms

Schulman, Moritz, Levine, Jordan, Abbeel ‘16. High dimensional continuous control with generalized advantage estimation
  1. Batch-mode actor-critic
  1. Blends Monte Carlo and GAE

Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, Kavukcuoglu ‘16. Asynchronous methods for deep reinforcement learning
  1. Online actor-critic, parallelized batch
  1. CNN end-to-end
  1. N-step returns with N=4
  1. Single network for actor and critic

Suggested Readings

Classic: Sutton, McAllester, Singh, Mansour (1999). Policy gradient methods for reinforcement learning with function approximation: actor-critic algorithms with value function approximation

Talks about contents in this class

Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, Kavukcuoglu (2016). Asynchronous methods for deep reinforcement learning: A3C -- parallel online actor-critic
Schulman, Moritz, Levine, Jordan, Abbeel (2016). High-dimensional continuous control using generalized advantage estimation: batch-mode actor-critic with blended Monte Carlo and function approximator returns
Gu, Lillicrap, Ghahramani, Turner, Levine (2017). Q-Prop: sample-efficient policy-gradient with an off-policy critic: policy gradient with Q-function control variate
https://arxiv.org/pdf/2108.08812.pdf ← Why actor-critic works better in offline training

Value Function / Q Function Fitting (Value-Based)

Can we omit policy gradient completely from actor-critic methods?

Actor-critic minimizes

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(t=tTA(st,at))\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta (a_{i,t}|s_{i,t}) (\sum_{t'=t}^T A(s_{t'},a_{t'}))

But the best policy can just be extracted from arg maxatAπ(st,at)\argmax_{a_t} A^{\pi}(s_t,a_t)

π(atst)={1if at=arg maxatAπ(st,at)0otherwise\pi'(a_t|s_t)=\begin{cases} 1 &\quad \text{if } a_t = \argmax_{a_t} A^{\pi}(s_t,a_t) \\ 0 &\quad \text{otherwise} \end{cases}

We know that π\pi ' is as good as π\pi, and probably better.

Same as before,

Aπ(s,a)=r(s,a)+γE[Vπ(s)]Vπ(s)A^\pi(s,a)=r(s,a)+\gamma \mathbb{E}[V^{\pi}(s')] - V^{\pi}(s)

Policy Iteration (Fitted Value Iteration)

So fitting value function can be done if we know the transition dynamics very easily using least square estimate… but what if we don’t know transition dynamics?

Policy Iteration without transition probability (Fitted Q-Iteration)

What if we change the policy evalution part to

Qπ(s,a)r(s,a)+γEsp(ss,a)[Qπ(s,π(s))]Q^\pi(s,a) \leftarrow r(s,a)+\gamma \mathbb{E}_{s' \sim p(s'|s,a)}[Q^\pi(s',\pi(s'))]

Now ⇒ if we have an (s,a,r,s)(s,a,r,s’) tuple then even if policy π\pi ' changes its action in state ss we can still do policy evalution on the tuple because the only place the policy shows up is inside the expectation.

This simple change allows us to policy iteration style algorithms without actually knowing the transition dynamics

Instead of using E[V(si)]\mathbb{E}[V(s_i)], we used maxaQϕ(si,ai)\max_{a’} Q_\phi(s_i’,a_i’), the value function V(si)V(s_i) is approxiamted by maxaQϕ(si,a)\max_{a’} Q_\phi(s_i,a’) and intead of doing expectations on the future possible states, we used the next state in the sample sis_i’

Special Cases of Fixed Q-Fitting

During exploration phase, using the greedy algorithm may not be the optimal choice (we want to maximize entropy!)

“Epsilon Greedy”

π(atst)={1ϵif at=arg maxaQϕ(st,at)ϵ/(A1)otherwise\pi(a_t|s_t)=\begin{cases} 1-\epsilon &\quad \text{if } a_t = \argmax_a Q_\phi(s_t,a_t) \\ \epsilon /(|A|-1) &\quad \text{otherwise} \end{cases}

“epsilon chance of exploring other options other than optimal action”

Common practice is to vary epsilon while training (early on(Q is bad), larger epsilon)

“Exponential”

π(atst)exp(Qϕ(st,at))\pi(a_t|s_t) \propto \exp(Q_\phi(s_t,a_t))
Best action will be most frequent, but leave some probability for exploration. And plus, if there is a second largest Q, then this second largest is considered more than other options.

Value Function Learning Theory

For tabular value iteration algorithm:

The Bellman Operator encaptures the logic of updating V(s)V(s)

And one fact is that:

VV^*, the value function for the optimal policy, is a fixed point of BB

VV^* always exists, is always unique, and always corresponds to the optimal policy

Does fixed point iteration converge to VV^*

We can prove that the value iteration reaches VV^* because BB is a contraction

Contraction means: for any VV and Vˉ\bar{V}, we have BVBVˉγVVˉ||BV-B\bar{V}||_\infin \le \gamma||V-\bar{V}||_\infin where γ(0,1)\gamma \in (0,1)

Meaning that VV and Vˉ\bar{V} will get closer and closer as we apply BB to them

If we replace Vˉ\bar{V} by VV^* (and since BV=VBV^* = V^*),

BVVγVV||BV-V^*||_\infin \le \gamma ||V-V^*||_\infin

For fitted value iteration algorithm:

Step 1 basically applies the bellman operator

We will define a new operator for step 2, Π\Pi (pi)

Step 2 performs the following:

“Find an value function in the value function model hypothesis space Ω\Omega that optimizes the objective”

Varg minVΩ12V(s)(BV)(s)2V' \leftarrow \argmin_{V' \in \Omega} \frac{1}{2} \sum ||V'(s)-(BV)(s)||^2

So step 2 performs this model fitting operator Π\Pi that projects BVBV onto hypothesis space Ω\Omega, our iteration algorithm can now be described as

VΠBVV \leftarrow \Pi BV

BB is still a contraction with respect to \infin norm

BVBVˉγVVˉ||BV-B\bar{V}||_\infin \le \gamma||V-\bar{V}||_\infin

Π\Pi is a contraction wr.t. L2-norm

ΠVΠVˉ2VVˉ2||\Pi V- \Pi \bar{V}||^2 \le ||V - \bar{V} ||^2

However, ΠB\Pi B is not a contraction of any kind

So fitted value iteration does not converge in general and often does not converge in practice

Same with fitted Q-Iteration, Batch actor-critic
🧙🏽‍♂️
Bad theoretical properties, but works in practice lol

But why? Isn’t value function fitting just graident descent?

We can actually turn this into a gradient descent algorithm by computing into those target functions ⇒ resulting algorithm is called residual algorithms, has poor numerical properties & doesn’t work well in practice

Correlation Problem in Q-Learning

⚠️
Sequential States are Strongly Correlated And Target value is always changing (chasing its own tails)

Think about a sequence

In the early 3 steps, the target value is kind of “overfit” to those 3 values

In the supervised setting of learning Q / Value functions, the data points seen by the target values are only local states

We can solve this by adding more batches at one time (by synchronized parallel Q-learning or asynchronous parallel Q-learning)

Or we use replay buffers (Since fitted Q-Learning is basically a off-policy method) ⇒ Samples are no longer correlated and multiple samples in the batch (low-variance gradient)

But where does the data come from:

  1. Need to periodically feed the replay buffer
⚠️
On full fitted Q-Iteration algorithm, we set ϕarg minϕ12iQϕ(si,ai)yi2\phi \leftarrow \argmin_{\phi} \frac{1}{2} \sum_i ||Q_\phi(s_i,a_i)-y_i||^2 But do we want it to actually converge to argmin if the sample has high variance?

Maybe just move one gradient step instead

ϕϕαidQϕdϕ(si,ai)(Qϕ(si,ai)yi)\phi \leftarrow \phi - \alpha \sum_i \frac{dQ_\phi}{d\phi}(s_i,a_i)(Q_\phi(s_i,a_i)-y_i)

Usually ⇒ K[1,4]K \in [1, 4], N[10000,50000]N \in [10000, 50000]

Target network makes sure that we’re not trying to hit a moving target ⇒ and now it looks more like supervised regression

🔥
Popular Alternative of target network update: ϕτϕ+(1τ)ϕ\phi ' \leftarrow \tau \phi ' + (1-\tau) \phi τ=0.999\tau = 0.999 works well

Question: Chasing its tail relationship, is there a typical scenerio where Q function chases its tail and shows bad behaviors?

Classic Deep Q-Learning Algorithm (DQN)

A special case(with specific parameters) of Q-learning with replay buffer and target network

A general view of Q-Learning

Online Q-learning (last lecture): evict immediately, process 1, process 2, and process 3 all run at the same speed.

DQN: Process 1 and process 3 run at the same speed, process 2 is slow.

Fitted Q-iteration: process 3 in the inner loop of process 2, which is in the inner loop of process 1

Overestimation in Q-Learning and Double Q-Learning

Q functions are not accurate (much much larger)
🔥
Why does Q-function think that it’s gonna get systematically larger values than true values.

Because target value

yj=rj+γmaxajundefinedthis last term is the problemQϕ(sj,aj)y_j = r_j + \gamma \underbrace{\max_{a_j '}}_{\mathclap{\text{this last term is the problem}}} Q_{\phi '}(s_j ', a_j ')

Imagine two r.v. X1,X2X_1, X_2

We can prove:

E[max(X1,X2)]max(E[X1],E[X2])\mathbb{E}[\max(X_1,X_2)] \ge \max(\mathbb{E}[X_1],\mathbb{E}[X_2])

But:

Qϕ(s,a)Q_{\phi '}(s',a') is not perfect, it is “noisy” ⇒ We are always selecting the positive errors while training using the max\max

Note that

maxaQϕ(s,a)=Qϕ(s,arg maxaQϕ(s,a)undefinedaction selected according to Qϕ)\max_{a'}Q_{\phi '}(s',a')=Q_{\phi '}(s',\underbrace{\argmax_{a'} Q_{\phi '}(s',a')}_{\mathclap{\text{action selected according to $Q_{\phi '}$}}})

So use “double Q-Learning” ⇒ use two networks

Update ϕA\phi A by using ϕB\phi B as a target network and ϕA\phi A as an action selector, opposite with ϕB\phi B update.

QϕA(s,a)r+γQϕB(s,arg maxaQϕA(s,a))Q_{\phi A}(s,a) \leftarrow r+\gamma Q_{\phi B}(s',\argmax_{a'} Q_{\phi A}(s',a'))
QϕB(s,a)r+γQϕA(s,arg maxaQϕB(s,a)Q_{\phi B}(s,a) \leftarrow r + \gamma Q_{\phi A}(s', \argmax_{a'} Q_{\phi B}(s',a')
Intuition: Choosing the “optimal” (but due to noise) action in QϕAQ_{\phi A} will cause the noise of the same action in QϕBQ_{\phi B} to be added into the result, therefore the “overestimation” effect is mitigated

In practice, instead of having two Q functions, use the two Q functions that we already have.

We already have ϕ\phi and ϕ\phi '

In standard Q-learning:

y=r+γQϕ(s,arg maxaQϕ(s,a))y = r+\gamma Q_{\phi '}(s', \argmax_{a'} Q_{\phi '}(s',a'))

Double Q-Learning:

y=r+γQϕ(s,arg maxaQϕ(s,a))y = r+\gamma Q_{\phi '}(s',\argmax_{a'} Q_{\phi}(s',a'))

Multi-step returns

Q-learning target

yj,t=rj,t+γmaxaj,t+1Qϕ(sj,t+1,aj,t+1)y_{j,t} = r_{j,t} + \gamma \max_{a_{j,t+1}} Q_{\phi '}(s_{j,t+1},a_{j,t+1})

Early on in training, the term γmaxaj,t+1Qϕ(sj,t+1,aj,t+1)\gamma \max_{a_j, t+1} Q_{\phi ‘}(s_{j,t+1},a_{j,t+1}) only gives us random noise if the network is bad, so rj,tr_{j,t} is the most important. But later in training, when QϕQ_{\phi'} gets better and better, this term dominates(since we have more terms then just a single term reward) and is more important.

So we can have something like “Monte Carle” sum of rewards in Actor-Critic algorithms

yj,t=t=tt+N1γttrj,t+γNmaxaj,t+N(sj,t+N,aj,t+N)y_{j,t} = \sum_{t' =t}^{t+N-1} \gamma^{t'-t}r_{j,t'} + \gamma^N \max_{a_{j,t+N}}({s_{j,t+N},a_{j,t+N})}
  1. Higher variance
  1. but lower bias
    1. Faster learning especially early on
  1. but only correct when on-policy learning
    1. In the second step it actually matters what action you take because the exploration policy might be different from current policy

How to fix the “on-policy” limit?

  1. Ignore the problem (😂 Yeah this is SOOOO CS)
    1. Often works very well
  1. Cut the trace
    1. dynamically choose N to get only on-policy data
    1. Works well when data mostly on-policy, and action space is small
  1. Importance sampling
    1. “Safe and efficient off-policy reinforcement learning. “ Munos et al. ‘16

Extending Q-Learning to continuous action space

Problem with continuous actions:

  1. Our policy has to select arg maxQϕ\argmax Q_\phi
  1. When evaluting the yjy_j for training QϕQ_\phi we also need to arg max\argmax
    1. Particularly problematic(since this runs in a loop!)

Options:

  1. Continuous Optimization Procedure
    1. Gradient based optimization (like SGD) is a bit slow in the inner loop
    1. Action space typically low-dimensional ⇒ So no need to do gradient optimization
  1. Stochastic optimization
    1. Uses the fact that action space is low-dimensional (easy to solve for optima)
  1. Use a function class that is easy to optimize
    1. Quadratic function?
      1. Gu, Lillicrap, Sutskever, L., ICML 2016
        1. Normalized Advantage Functions (NAF) architecture
    1. No change to algorithm
    1. Just as efficient as Q-Learning
    1. But loses representational power
  1. Learn an approximate maximizer
    1. Learn a function maximizer using neural nets
    1. DDPG (Lillicrap et al. ICLR 2016)
      1. “Deterministic” actor-critic (really approximate Q-Learning)
      1. maxaQϕ(s,a)=Qϕ(s,arg maxaQϕ(s,a))\max_{a} Q_{\phi} (s,a) = Q_\phi (s, \argmax_a Q_\phi (s,a))
        1. Idea is to train another network μθ(s)\mu_\theta(s) such that μθ(s)arg maxaQϕ(s,a)\mu_\theta(s) \approx \argmax_a Q_{\phi}(s,a)
        1. dQϕdθ=dadθdQθda\frac{dQ_\phi}{d\theta} = \frac{da}{d\theta} \frac{d Q_{\theta}}{da}
    1. classic: NFQCA
    1. recent: TD3 and SAC

Stochastic Optimization

Simple Solution

maxaQ(s,a)max{Q(s,a1),,Q(s,aN)}\max_{a} Q(s,a) \approx \max \{ Q(s,a_1), \dots, Q(s,a_N) \}

Where (a1,,aN)(a_1, \dots, a_N) are sampled from some distribution (e.g. uniform)

  1. Dead simple
  1. Efficiently parallelizable
  1. But - not very accurate

More accurate (works OK for ≥ 40 dims):

  1. Cross-entropy method (CEM)
    1. simple iterative stochastic optimization
    1. sampling actions from distribution like in the simple method but then refines to a range and then continues to sample in a smaller and smaller range
  1. CMA-ES
    1. substantially less simple iterative stochastic optimization

Implementation Tips

Instability of Q Learning
Huber loss (interpolating between squared error and absolute value loss)
  1. Q-learning takes some care to stabilize
    1. Test on easy, reliable tasks first, make sure the implementation is correct
    1. Large replay buffers help improve stability
      1. Looks more like fitted Q-iteration
    1. Takes time, be patient - might be no better than random for a while
    1. Start with high exploration (epsilon) and gradually reduce
    1. Bellman error gradients can be big; clip gradients or use Huber loss
  1. Double Q-learning helps a lot in practice
    1. Simple & no downsides!
  1. N-step returns also help a lot
    1. but have some downsides (will systematically bias the objective)
  1. Schedule exploration (high to low) and learning rates (high to low), Adam optimizer can help too
  1. Run multiple random seeds, its very inconsistent between runs

Suggested Readings

  1. Classic papers
    1. Watkins. (1989). Learning from delayed rewards: introduces Q-learning
    1. Riedmiller. (2005). Neural fitted Q-iteration: batch-mode Q-learning with neural networks
  1. Deep reinforcement learning Q-learning papers
    1. Lange, Riedmiller. (2010). Deep auto-encoder neural networks in reinforcement learning: early image-based Q-learning method using autoencoders to construct embeddings
    1. Mnih et al. (2013). Human-level control through deep reinforcement learning: Q- learning with convolutional networks for playing Atari.
    1. Van Hasselt, Guez, Silver. (2015). Deep reinforcement learning with double Q-learning: a very effective trick to improve performance of deep Q-learning.
    1. Lillicrap et al. (2016). Continuous control with deep reinforcement learning: continuous Q-learning with actor network for approximate maximization.
    1. Gu, Lillicrap, Stuskever, L. (2016). Continuous deep Q-learning with model-based acceleration: continuous Q-learning with action-quadratic value functions.
    1. Wang, Schaul, Hessel, van Hasselt, Lanctot, de Freitas (2016). Dueling network architectures for deep reinforcement learning: separates value and advantage estimation in Q-function.

Model-based Methods

🔥
Why Model-based methods? Because if we know the transition function f(st,at)=P(st+1st,at)f(s_t, a_t) = \mathbb{P}_{(s_{t+1}|s_t,a_t)} then we can use trajectory planning ⇒ So we can learn f(st,at)f(s_t,a_t) from data
📌
Model-based RL can be prone to error in distribution mismatch ⇒ very like behavior cloning because when planned trajectory queries out-of-sample states, then f()f(\cdot) is likely to be wrong

NPC

Performance Gap Issue

We want good exploration to gather more data to address the distribution shift, but with a not-so-good early model we cannot deploy good exploration techniques

Basically the model early on overfits an function approximator and the exploration policy is stuck on a local minima / maxima

Uncertainty Estimation

Uncertainty estimation helps us reduce performance gap.

Under this trick, the expectation of reward under high-variance prediction is very low, even though the mean is the same

However:

  1. We still need exploration to get better model
  1. Expected value is not the same as pessimistic value

But how to build it?

  1. Use NN?
    1. Cannot express uncertainty of unseen data accurately ⇒ when querying out-of-sample data, the uncertainty output is arbitrary
    1. “The model is certain about data, but we are not certain about the model”
  1. Estimate Model Uncertainty
    1. Estimate arg maxθlogP(θD)=arg maxθlogP(Dθ)\argmax_\theta \log \mathbb{P}(\theta|D) = \argmax_\theta \log \mathbb{P}(D|\theta)
    1. Can we instead estimate the distribution P(θD)\mathbb{P}(\theta|D)?
      1. We can use this to get a posterior distribution over the next state
        1. p(st+1st,at,θ)p(θD)dθ\int p(s_{t+1}|s_t,a_t,\theta) p(\theta|D) d\theta
      1. Bayesian Neural Nets
        1. Every weight, instead of being a number, is a distribution
        1. But it’s difficult to join parameters into a single distribution
        1. So we do approximation
          1. p(θD)=ip(θiD)p(\theta|D) = \prod_i p(\theta_i|D)
            1. Very Crude approximation
          1. p(θiD)=N(μi,σi2)p(\theta_i|D) = N(\mu_i, \sigma_i^2)
            1. Instead of learning numerical value learn mean value and std/variance
          1. Bllundell et al. Weight Uncertainty in Neural Networks
          1. Gal et al. Concrete Dropout
  1. Bootstrap Ensembles
    1. Train several different neural nets and bootstrap samples(DiD_i sample with replacement from DD) to feed to those networks
      1. So each network learns slightly differently
      1. In theory each NN learns well on their own training data but makes different mistakes outside of training data
    1. Then estimate p(θD)1Niδ(θi)p(\theta|D) \approx \frac{1}{N} \sum_i \delta(\theta_i)
    1. p(st+1st,at,θ)p(θD)dθ1Nip(st+1st,at,θi)\int p(s_{t+1}|s_t,a_t,\theta) p(\theta|D) d\theta \approx \frac{1}{N} \sum_i p(s_{t+1}|s_t,a_t,\theta_i)
    1. Very crude approximation
      1. Because the number of models is usually small(expensive)
      1. But works
      1. Resampling with replacement is usually unnecessary, because SGD and random initialization usually makes the models sufficiently independent

With Complex Observations (images)

  1. High dimensionality with observation prediction model
    1. Redundancy
  1. Partial Observability

Can we seperately learn p(otst)p(o_t | s_t), which is high-dim but not dynamic, and p(st+1st,at)p(s_{t+1}|s_t,a_t), which is low-dim but dynamic?

Now we need to maximize

maxϕ1Ni=1Nt=1TE[logpϕ(st+1st,i,at,i)+logpϕ(ot,ist,i)]\max_\phi \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \mathbb{E}[\log p_\phi(s_{t+1}|s_{t,i},a_{t,i}) + \log p_\phi (o_{t,i}|s_{t,i})]

Note that the expectation is conditioned on

(st,st+1)p(st,st+1o1:T,a1:T)(s_t,s_{t+1}) \sim p(s_t,s_{t+1}|o_{1:T},a_{1:T})

So can we learn an additional approximate posterior (encoder):

qψ(sto1:t,a1:t)q_\psi (s_t|o_{1:t},a_{1:t})

Also we have other choices:

  1. qψ(st,st+1o1:T,a1:T)q_\psi(s_t,s_{t+1}|o_{1:T},a_{1:T}) - full smoothing posterior
    1. most accurate
    1. most complicated to train
    1. Preferred when the state is less fully-observed
  1. qψ(stot)q_\psi (s_t | o_t)
    1. single-step encoder
    1. simplest but less accurate
    1. stochastic case requires variational inference
    1. Deterministic: qψ(stot)=δ(st=gψ(ot))q_\psi(s_t|o_t) = \delta(s_t = g_\psi(o_t))

To use uncertainty

Before uncertainty is introduced, we have (assuming st+1=f(st,at)s_{t+1} = f(s_t,a_t)

J(a1,,aH)=t=1Hr(st,at)J(a_1,\dots,a_H) = \sum_{t=1}^H r(s_t,a_t)

Now(With bootstrap):

J(a1,,aH)=1Ni=1Nt=1Hr(st,i,ai)J(a_1,\dots,a_H) = \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^H r(s_{t,i},a_i)

Other options:

  1. Moment matching
  1. more complex posterior estimation with BNNs

Further Readings

  1. Classic
    1. Deisenroth et al. PILCO: A Model-Based and Data-Efficient Approach to Policy Search.
  1. Recent papers:
    1. Nagabandi et al. Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning.
    1. Chua et al. Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models.
    1. Feinberg et al. Model-Based Value Expansion for Efficient Model-Free Reinforcement Learning.
    1. Buckman et al. Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion.

Can we just backprop through policy / dynamics model?

📌
Generally No!
  1. Similar parameter sensitivity problems as shooting methods
    1. But no longer have convenient second order LQR-like method, because policy parameters couple all the time steps
  1. Similar problems to training long RNNs with BPTT
    1. Vanishing and exploding gradients

Let’s compare policy gradient:

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)Q^i,tπ\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\hat{Q}_{i,t}^\pi

with backprop (pathwise) gradient:

θJ(θ)=t=1Tatθst+1at(t=t+1Trtst(t=t+2tstat1at1st1+stst1))\nabla_\theta J(\theta) = \sum_{t=1}^T \frac{\partial a_t}{\partial \theta} \frac{\partial s_{t+1}}{\partial a_t}(\sum_{t'=t+1}^T \frac{\partial r_{t'}}{\partial s_{t'}}(\prod_{t'' = t+2}^{t'} \frac{\partial s_{t''}}{\partial a_{t''-1}}\frac{\partial a_{t''-1}}{\partial s_{t''-1}}+\frac{\partial s_{t''}}{\partial s_{t''-1}}))

Policy gradient might be more stable (if enough samples are used) because it does not require multiplying many Jacobians

Parmass et al. “18: PIPP: Flexible Model-Based Policy Search Robust to the Curse of Chaos”

It we use policy gradient to backprop through our model dynamics model, there would still be a problem (although generating samples does not require interacting with physical world or simulator)

Left: Open-loop planning with long rollouts; Middle: Open-loop planning with shorter rollouts; Right: Collect real-world data from current policy and do short rollouts planning with data sampled from real distribution

But this does not solve the problem because the point of using model-based RL is being sample efficient and using sampled real world data and do short-rollout open-loop planning restrict how much we can change our policy.

Model-based RL with Policies

📌
Question is how to use the model’s predictions? fit Q functions or Value functions or what?
  1. Model-Based Acceleration (MBA)
    1. Gu et al. Continuous deep Q-learning with model-based acceleration. ‘16
  1. Model-Based Value Expansion (MVE)
    1. Feinberg et al. Model-based value expansion. ’18
  1. Model-Based Policy Optimization (MBPO)
    1. Janner et al. When to trust your model: model-based policy optimization. ‘19

Good Idea:

  1. Sample Efficient

Bad Idea:

  1. Additional Source of bias if model is not correct
    1. Reduce this by using ensemble methods
  1. If not seen states before then we still cannot perform valid action

Multi-step Models and Successor Representations

📌
The job of the model is to evaluate the policy
Vπ(st)=t=tγttEp(ststEatst)[r(st,at)]=t=tγttEp(stst)[r(st)]=t=tγttsp(st=sst)r(s)=s(t=tγttp(st=sst))undefinedAdd (1γ) term to the front makes it pπ(sfuture=sst)r(s)\begin{split} V^\pi(s_t) &= \sum_{t=t'}^\infin \gamma^{t'-t} \mathbb{E}_{p(s_{t'}|s_t} \mathbb{E}_{a_{t'}|s_{t'})}[r(s_{t'},a_{t'})] \\ &= \sum_{t=t'}^\infin \gamma^{t' - t} \mathbb{E}_{p(s_{t'}|s_t)}[r(s_{t'})] \\ &= \sum_{t = t'}^\infin \gamma^{t'-t} \sum_s p(s_{t'} = s|s_t)r(s) \\ &= \sum_s \underbrace{(\sum_{t=t'}^\infin \gamma^{t' - t} p(s_{t'} = s | s_t))}_{\mathclap{\text{Add $(1-\gamma)$ term to the front makes it $p_\pi(s_{future}=s|s_t)$}}} r(s) \end{split}
pπ(sfuture=sst)=(1γ)t=tγttp(st=sst)p_\pi(s_{future}=s|s_t) = (1-\gamma)\sum_{t=t'}^\infin \gamma^{t' - t} p(s_{t'} = s | s_t)

Therefore

Vπ(st)=11γspπ(sfuture=sst)r(s)undefinedμπ(st)rV^\pi(s_t) = \frac{1}{1-\gamma} \underbrace{\sum_s p_\pi(s_{future}=s|s_t) r(s)}_{\mathclap{\mu^\pi(s_t)^\top \vec{r}}}

Where

μiπ=pπ(sfuture=sist)\mu_i^\pi = p_\pi(s_{future} = s_i|s_t)

is the “successor representation”

Dayan Improving Generalization for Temporal Difference Learning: The Successor Representation, 1993.

μiπ(st)=(1γ)t=tγttp(st=ist)=(1γ)δ(st=i)+γEatπ(atst),st+1p(st+1st,at)[μiπ(st+1]\begin{split} \mu_i^\pi(s_t) &= (1-\gamma) \sum_{t'=t}^\infin \gamma^{t'-t} p(s_{t'} = i | s_t) \\ &= (1-\gamma)\delta(s_t = i) + \gamma \mathbb{E}_{a_t \sim \pi(a_t|s_t), s_{t+1} \sim p(s_{t+1}|s_t,a_t)}[\mu_i^\pi(s_{t+1}] \end{split}

Hmm…looks like a bellman backup with “reward” r(st)=1γδ(st=i)r(s_t) = 1-\gamma \delta(s_t = i)

But issues:

  1. Not clear if learning successor representation is easier than model-free RL
  1. how to scale to large state spaces?
  1. How to extend to continuous state spaces?

To scale to large state spaces, use successor feature of sts_t

ψjπ(st)=sμsπ(st)ϕj(s)=μπ(st)ϕj\psi_j^\pi(s_t) = \sum_s \mu_s^\pi(s_t)\phi_j(s) = \mu_\pi(s_t)^\top \vec{\phi}_j

If we can express the reward function as combination of features of state s

r(s)=jϕj(s)wj=ϕ(s)wr(s) = \sum_j \phi_j(s) w_j = \phi(s)^\top w

Turns out we can express the value function as

Vπ(st)=ψπ(st)w=jμπ(st)ϕjw=μπ(st)jϕjw=μπ(st)rV^\pi(s_t) = \psi^\pi(s_t)^\top w = \sum_j \mu^\pi(s_t)^\top \vec{\phi}_j w = \mu^\pi(s_t)^\top \sum_j \vec{\phi}_j w = \mu^\pi(s_t)^\top \vec{r}

The dimensionality of ϕ\phi and ψ\psi would be a lot lower than the dimension of possible states

Can also construct a Q-function like version

ψjπ(st,at)=ϕj(st)+γEst+1p(st+1st,at),at+1π(at+1st+1)[ψjπ(st+1,at+1)]\psi_j^\pi(s_t,a_t) = \phi_j(s_t) + \gamma \mathbb{E}_{s_{t+1} \sim p(s_{t+1}|s_t,a_t), a_{t+1} \sim \pi(a_{t+1}|s_{t+1})}[\psi_j^\pi(s_{t+1},a_{t+1})]
Qπ(st,at)ψπ(st,at)wQ^\pi(s_t,a_t) \approx \psi^\pi(s_t,a_t)^\top w

(when r(s)=ϕ(s)wr(s) = \phi(s)^\top w)

Using Successor Features

  1. Recover a Q-function very quickly
    1. Train ψπ(st,at)\psi^\pi(s_t,a_t) via Bellman backups
    1. Get some reward samples {si,ri}\{s_i, r_i \}
    1. Get warg minwiϕ(si)wri2w \leftarrow \argmin_w \sum_i ||\phi(s_i)^\top w - r_i||^2
    1. Recover Qπ(st,at)ψπ(st,at)wQ^\pi(s_t,a_t) \approx \psi^\pi(s_t,a_t)^\top w
      1. π(s)=arg maxaψπ(s,a)w\pi’(s) = \argmax_a \psi^\pi(s,a)^\top w
        1. This is only equivalent to one step of policy iteration because Q is with respect to the actor
  1. Recover many Q-function
    1. Train ψπk(st,at)\psi^{\pi_k}(s_t,a_t) for many policies πk\pi_k via Bellman backups
    1. Get some reward samples {si,ri}\{s_i, r_i \}
    1. Get warg minwiϕ(si)wri2w \leftarrow \argmin_w \sum_i ||\phi(s_i)^\top w - r_i||^2
    1. Recover Qπk(st,at)ψπk(st,at)wQ^{\pi_k}(s_t,a_t) \approx \psi^{\pi_k}(s_t,a_t)^\top w for every πk\pi_k
      1. π(s)=arg maxamaxkψπk(s,a)w\pi’(s) = \argmax_a \max_k \psi^{\pi_k}(s,a)^\top w
        1. This is only equivalent to one step of policy iteration because Q is with respect to the actor
        1. Finding to highest reward policy in each state

Continuous Successor Representations (C-Learning)

🧙🏽‍♂️
So can we just frame successor representation as classification?
pπ(F=1undefinedmeans sfuture is a future state under πst,at,sfuture)=pπ(sfuturest,at)pπ(sfuturest,at)+pπ(sfuture)p^\pi(\underbrace{F=1}_{\mathclap{\text{means $s_{future}$ is a future state under $\pi$}}}|s_t,a_t,s_{future}) = \frac{p^\pi(s_{future}|s_t,a_t)}{p^\pi(s_{future}|s_t,a_t)+p^\pi(s_{future})}

Positive set during training:

D+pπ(sfuturest,at)D_+ \sim p^\pi(s_{future}|s_t,a_t)

Negative set just pulled randomly.

To train (On-policy algorithm, can also derive an off-policy algorithm that is also very efficient):

  1. Sample spπ(s)s\sim p^\pi(s) ⇒ run policy, sample from trajectories
  1. Sample spπ(sfuturest,at)s \sim p^\pi(s_{future}|s_t,a_t) ⇒ pick positive future pairs
  1. Update pπ(F=1st,at,s)p^\pi(F=1|s_t,a_t,s) using SGD with cross entropy loss

Eysenbach, Salakhutdinov, Levine. C-Learning: Learning to Achieve Goals via Recursive Classification. 2020.

Exploration (Lec 13)

It’s hard for algorithms to know what rules of the environment are “Mao” Game

  1. The only rule you may be told is this one
    1. Incur a penalty when you break a rule
    1. Can only discover rules through trial and error
    1. Rules don’t always make sense to you

So here is the definition of exploration problem:

  1. How can an agent discover high-reward strategies that require a temporally extended sequence of complex behaviors that, individually, are not rewarding?
  1. How can an agent decide whether to attempt new behaviors or continue to do best things it knows so far?

So can we derive an optimal exploration strategy:

Tractable(易于理解) - If it is easy to know that the policy is optimal or not

  1. Multi-arm bandits (1-step, stateless RL) / Contextual bandits (1 step, but there’s a state)
    1. Can formalize exploration as POMDP(Partial-observed MDP because although there’s only one time, performing actions help us expand our knowledge) identification
    1. policy learning is trivial even with POMDP
  1. small, finite MDPs
    1. Can frame as Bayesian model identification, reason explicitly about value of information
  1. Large or infinite MDPs
    1. Optimal methods don’t work
      1. Exploring with random actions: oscillate back and forth, might not go to a coherent or interesting place
    1. But can take inspiration from optimal methods in smaller settings
    1. use hacks

Bandits

The dropsophila(黑腹果蝇, 模式生物) of exploration problems
“One arm bandit” is a lot machine
Multi-arm bandit ⇒ have to make decision on which machine to play; Pulling one of the arms doesn’t mean it’s a bad decision

So the definition of the bandit:

Assume r(ai)pθi(ri)r(a_i) \sim p_{\theta_i}(r_i)

e.g. p(ri=1)=θip(r_i = 1) = \theta_i and p(ri=0)=1θip(r_i = 0) = 1 - \theta_i

Where (but you don’t know):

θip(θ)\theta_i \sim p(\theta)

This defines a POMDP with s=[θ1,,θn]s = [\theta_1, \dots, \theta_n]

Belief state is p^(θ1,,θn)\hat{p}(\theta_1, \dots, \theta_n)

Solving the POMDP yields the optimal exploration strategy

But this is an overkill - belief state is huge! ⇒ we can do very well with much simpler strategies

We will define regret as the difference from optimal policy at time step TT

Reg(T)=TE[r(a)]t=1Tr(at)Reg(T) = T\mathbb{E}[r(a^*)] - \sum_{t=1}^T r(a_t)

How do we beat bandit?

  1. Variety of relatively simple strategies
  1. Often can provide theoretical guarantees on regret
    1. Empirical performance may vary
  1. Exploration strategies for more complex MDP domains will be inmspired by those strategies

Optimistic exploration Reg(T)O(logT)Reg(T) \in O(\log T)

  1. Keep track of average reward μ^a\hat{\mu}_a for each action a
  1. Exploitation: pick a=arg maxμ^aa = \argmax \hat{\mu}_a
  1. Optimistic Estimate: a=arg maxμ^a+Cσaa = \argmax \hat{\mu}_a + C \sigma_a
    1. “Try each arm until you are sure it’s not great”
    1. a=arg maxμ^a+2lnTN(a)a = \argmax \hat{\mu}_a + \sqrt{\frac{2 \ln T}{N(a)}} where N(a)N(a) is number of times we have picked this action

To put this in practice, use r+(s,a)=r(s,a)+B(N(s))r^+(s,a) = r(s,a) + B(N(s)) ⇒ easy to add in but hard to tune the weights of B(N(s))B(N(s)).

Problem is in continuous states ⇒ we never see same thing twice ⇒ but some states are more similar than others

Beliemare et al. “Unifying Count-based Exploration …”

So we will

  1. Fit a density model pθ(s)p_\theta(s) or pθ(s,a)p_\theta(s,a)
  1. pθ(s)p_\theta(s) might be high even for a new ss if ss is similar to previous seen states
  1. But how can the distribution of pθp_\theta match the distribution of a count?

Notice that the true probability is P(s)=N(s)nP(s) = \frac{N(s)}{n}, and after seeing ss, P(s)=N(s)+1n+1P’(s) = \frac{N(s) + 1}{n+1}

How to get to N^(s)\hat{N}(s)?

pθ(si)=N^(si)n^,pθ(si)=N^(si)+1n^+1p_\theta(s_i) = \frac{\hat{N}(s_i)}{\hat{n}}, p_{\theta}'(s_i) = \frac{\hat{N}(s_i)+1}{\hat{n}+1}

Solving the equation, we find

N^(si)=n^pθ(si),n^=1pθ(si)pθ(si)pθ(si)pθ(si)\hat{N}(s_i) = \hat{n}p_\theta(s_i), \hat{n} = \frac{1 - p_\theta'(s_i)}{p_\theta'(s_i)-p_\theta(s_i)} p_\theta(s_i)

Different classes of bonuses:

  1. UCB bonus B(N(s))=2lnnN(s)B(N(s)) = \sqrt{\frac{2 \ln n}{N(s)}}
  1. MBIE-EB (Strehl & Littman, 2008): B(N(s))=1N(s)B(N(s)) = \sqrt{\frac{1}{N(s)}}
  1. BEB (Kolter & Ng, 2009): B(N(s))=1N(s)B(N(s)) = \frac{1}{N(s)}

So we need to output densities, but doesn’t necessarily need to produce great samples

Opposite considerations from many popular generative models like GANs

Bellemare et al. “CTS” Model: Condition each pixel on its top-left neighborhood

Other models: Stochastic neural networks, compression length, EX2

Probability Matching / Posterior Sampling (Thompson Sampling)

Assume r(ai)pθi(ri)r(a_i) \sim p_{\theta_i} (r_i) ⇒ which defines a POMDP with s=[θ1,,θn]s = [\theta_1, \dots, \theta_n]

Belief state is p^(θ1,,θn)\hat{p}(\theta_1, \dots, \theta_n)

Idea:

  1. Sample θ1,,θnp^(θ1,,θn)\theta_1, \dots, \theta_n \sim \hat{p}(\theta_1, \dots, \theta_n)
  1. Pretend the model θ1,,θn\theta_1, \dots, \theta_n is correct
  1. Take the optimal action
  1. Update the model
Chapelle & Li, “An Empirical Evaluation of Thompson Sampling.”

⇒ Hard to analyze theoretically but can work very well empirically

Osband et al. “Deep Exploration via Boostrapped DQN”
  1. What do we sample?
  1. how do we represent the distribution?

Bootstrap ensembles:

  1. Given a dataset DD, resample with replacement NN times to get D1,,DND_1, \dots, D_N
  1. train each model fθif_{\theta_i} on DiD_i
  1. To sample from p(θ)p(\theta), sample i[1,,N]i \in [1, \dots, N] and use fθif_{\theta_i}
Train NN big neural nets is expensive, can we avoid it? In the paper they trained a net with different heads ⇒ this may actually be undesirable since now the outputs are correlated but they may be uncorrelated enough for us to use
🧙🏽‍♂️
This might help better than epsilon-greedy because instead of maybe oscillating back adn forth (and not go somewhere intersting at all), we are commited to a randomized but internally consistant strategy for an entire episode

Methods that use Information Gain

Bayesian Experimental Design

If we want to determine some latent variable zzzz might be optimal action or its value

Which action do we take?

Let H(p^(z))H(\hat{p}(z)) be the current entropy of our zz estimate

Let H(p^(z)y)H(\hat{p}(z)|y) be the entropy of our zz estimate after observation yy (yy might be r(a)r(a) in the case of RL)

The lower the entropy, the more precisely we know zz

IG(z,y)=Ey[H(p^(z))H(p^(z)y)]IG(z,y) = \mathbb{E}_y [H(\hat{p}(z)) - H(\hat{p}(z)|y)]

Typically depends on action, so we will use the notion IG(z,ya)=Ey[H(p^(z)H(p^(z)y)a)]IG(z, y|a) = \mathbb{E}_y [H(\hat{p}(z) - H(\hat{p}(z)|y)|a)]

🧙🏽‍♂️
One important thing to find is decide what yy is in our problem and what zz is
Russo & Van Roy “Learning to Optimize via Information-Directed Sampling”
y=ra,z=θay = r_a, z = \theta_a

Observe reward and learn parameters of p^(ra)\hat{p}(r_a)

g(a)=IG(θa,raa)information gain of ag(a) = IG(\theta_a, r_a|a) \rightarrow \text{information gain of $a$}
Δ(a)=E[r(a)r(a)]expected suboptimality of a\Delta(a) = \mathbb{E}[r(a^*) - r(a)] \rightarrow \text{expected suboptimality of $a$}

We want to gain more information but we don’t want our policy to be suboptimal

So we will choose aa according to

arg minaΔ(a)2g(a)\argmin_a \frac{\Delta(a)^2}{g(a)}

Houthooft et al. “VIME”

Question:

  1. Information gain about what?
    1. Reward r(s,a)r(s,a)
      1. Not useful is reward is sparse
    1. State density p(s)p(s)
      1. A bit strange, but somewhat makes sense!
    1. Dynamics p(ss,a)p(s’|s,a)
      1. Good proxy for learning the MDP, though still heuristic

Generally they are intractable to use exactly, regardless of what is being estimated

A few approximations:

  1. Prediction Gain (Schmidhuber ‘91, Bellemare ‘16)
    1. logpθ(s)logpθ(s)\log p_{\theta}’ (s) - \log p_\theta(s)
    1. If density changed a lot, then the state is novel
  1. Variational Inference (Houthooft et al. “VIME”)
    1. IG can be equivalently written as DKL(p(zy),p(z))D_{KL}(p(z|y), p(z))
    1. Learn about transitions pθ(st+1st,at):z=θp_\theta(s_{t+1}|s_t,a_t): z = \theta
    1. y=(st,at,st+1)y = (s_t, a_t, s_{t+1})
    1. Intuition: a transition is more informative if it causes belief over θ\theta to change
    1. Idea:
      1. Use variational inference to estimate q(θϕ)p(θh)q(\theta|\phi) \approx p(\theta|h)
        1. Specifically optimize variational lower bound DKL(q(θϕ),p(hθ)p(θ))D_{KL}(q(\theta|\phi), p(h|\theta) p(\theta))
        1. Represent q(θϕ)q(\theta|\phi) as product of independent Gaussian parameter distributions with mean ϕ\phi
          1. p(θD)=ip(θiD)p(\theta|D) = \prod_i p(\theta_i |D)
          1. p(θiD)=N(μi,σi2undefinedϕ)p(\theta_i|D) = N(\underbrace{\mu_i, \sigma_i^2}_{\mathclap{\phi}})
        1. See Blundell et al. “Weight uncertainty in neural networks
      1. Given new transition (s,a,s)(s,a,s’), update ϕ\phi to get ϕ\phi'
      1. use DKL(q(θϕ),q(θϕ))D_{KL}(q(\theta|\phi’), q(\theta|\phi)) as approximate bonus

Counting with Hashes

What if we still count states, but compress them into hashes and count them

  1. Compress ss into k-bit code via ϕ(s)\phi(s) then count N(ϕ(s))N(\phi(s))
  1. shorter codes = more hash collisions
  1. similar states get the same hash? Maybe
    1. Depends on model we choose
    1. We can use an autoencoder and this improve the odds of doing this
Tang et al. “#Exploration: A Study of Count-Based Exploration”

Implicit Density Modeling with Exemplar Models

Fu et al. “EX2: Exploration with Exemplar Models…”

We ned pθ(s)p_\theta(s) to be able to output densities, but doesn’t necessarily need to produce great samples ⇒ can we explicitly compare the new state to past states?

🧙🏽‍♂️
Intuition: The state is novel if it is easy to distinguish from all previous seen states by a classifier

For each observed state ss, fit a classifier to classify that state against all past states DD and use classifier error to obtain density

pθ(s)=1Ds(s)Ds(s)p_\theta(s) = \frac{1-D_s(s)}{D_s(s)}

where Ds(s)D_s(s) is the probability that the classifier assigns ss as a “positive”

In reality, every state is unique so we regularize the classifier

Heuristic estimation of counts via errors

We just need to tell if state is novel or not

Given buffer D={(si,ai)}D = \{(s_i,a_i)\}, fit f^θ(s,a)\hat{f}_\theta(s,a)

So we basically use the error of f^θ(s,a)\hat{f}_\theta(s,a) and actual f(s,a)f^*(s,a) as a bonus

ϵ(s,a)=f^θ(s,a)f(s,a)2\epsilon(s,a) = ||\hat{f}_\theta(s,a)-f^*(s,a)||^2

But what function should we use for ff^*?

  1. f(s,a)=sf^*(s,a) = s’
  1. f(s,a)f^*(s,a) as a neural network, but with parameters chosen randomly

We mentioned about how DKL(q(θϕ),q(θϕ))D_{KL}(q(\theta|\phi’), q(\theta|\phi)) can be seen as Information Gain, it can also be viewed as change in network parameters ϕ\phi

So if we forget about IG, there are many other ways to measure this

Stadie et al. 2015:

  1. encode image observations using auto-encoder
  1. build predictive model on auto-encoder latent states
  1. use model error as exploration bonus

Schmidhuber et al. (see, e.g. “Formal Theory of Creativity, Fun, and Intrinsic Motivation):

  1. exploration bonus for model error
  1. exploration bonus for model gradient
  1. many other variations

Suggested Readings

  1. Schmidhuber. (1992). A Possibility for Implementing Curiosity and Boredom in Model-Building Neural Controllers.
  1. Stadie, Levine, Abbeel (2015). Incentivizing Exploration in Reinforcement Learning with Deep Predictive Models.
  1. Osband, Blundell, Pritzel, Van Roy. (2016). Deep Exploration via Bootstrapped DQN.
  1. Houthooft, Chen, Duan, Schulman, De Turck, Abbeel. (2016). VIME: Variational Information Maximizing Exploration.
  1. Bellemare, Srinivasan, Ostroviski, Schaul, Saxton, Munos. (2016). Unifying Count-Based Exploration and Intrinsic Motivation.
  1. Tang, Houthooft, Foote, Stooke, Chen, Duan, Schulman, De Turck, Abbeel. (2016). #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning.
  1. Fu, Co-Reyes, Levine. (2017). EX2: Exploration with Exemplar Models for Deep Reinforcement Learning.

Unsupervised learning of diverse behaviors

What if we want to recover diverse behavior without any reward function at all?
One possible case is to describe a target goal and the machine would just try to reach the goal
Nair*, Pong*, Bahl, Dalal, Lin, L. Visual Reinforcement Learning with Imagined Goals. ’18 Dalal*, Pong*, Lin*, Nair, Bahl, Levine. Skew-Fit: State-Covering Self-Supervised Reinforcement Learning. ‘19

xˉ\bar{x} may not be equal to xgx_g at all…but we will do some updates regardless.

pθ(xz)p_\theta(x|z) ⇒ Latent Model (VAE) that transforms a proposed state zz to xx

However: This means the model tends to generate states that we’ve already seen (since we sample from p(z)p(z) ⇒ and we are not exploring very well

How do we reach diverse goals?

We will modify step 4 ⇒ instead of doing MLE: θ,ϕarg maxθ,ϕE[logp(xˉ)]\theta, \phi \leftarrow \argmax_{\theta, \phi} \mathbb{E}[\log p(\bar{x})], we will do a weighted MLE:

θ,ϕarg maxθ,ϕE[w(xˉ)logp(xˉ)]\theta,\phi \leftarrow \argmax_{\theta,\phi} \mathbb{E}[w(\bar{x}) \log p(\bar{x})]
We want to assign a greater weight to rarely seen actions, and since we’ve used a generator for pθ(xz)p_\theta(x|z), we can use
w(xˉ)=pθ(xˉ)α,α[1,0)w(\bar{x}) = p_\theta(\bar{x})^{\alpha}, \alpha \in [-1,0)

⇒ Note that for any α[1,0)\alpha \in [-1,0), the entropy H(pθ(x))H(p_\theta(x)) increases ⇒ proposing broader and broader goals

The RL formulation:

maxH(p(G))\max H(p(G))
  1. π(aS,G)\pi(a|S,G) trained to reach goal GG
    1. as π\pi gets better, the final state SS gets close to GGp(GS)p(G|S) becomes more deterministic

To maximize exploration,

maxH(p(G))H(p(GS))=maxI(S;G)\max H(p(G)) - H(p(G|S)) = \max I(S;G)

This is mainly for gathering data that leads to uniform density over the state distribution ⇒ doesn’t mean the policy is going randomly, what if we want a policy that goes randomly?

But is state-entropy really a good objective?
Eysenbach’s Theorem: At test time, an adversary will choose the worst goal G ⇒ Then the best goal distribution use for training would be p(G)=arg maxpH(p(G))p(G) = \argmax_p H(p(G)) Hazan, Kakade, Singh, Van Soest. Provably Efficient Maximum Entropy Exploration

State Marginal Matching (SMM)

Learn π(as)\pi(a|s) to minimize DKL(pπ(s),p(s))D_{KL}(p_\pi(s), p^*(s))

If we want to use intrinsic motivation, we have:

r~(s)=logp(s)logpπ(s)\tilde{r}(s) = \log p^*(s) - \log p_\pi(s)

Note that this reward objective sums to exactly the state-marginal objective above ⇒ However RL is not aware that logpπ(s)-\log p_\pi (s) depends on π\pi ⇒ tail chasing problem

We can prove that this π(as)\pi^*(a|s) is a nash equilibrium for a two-player game

Learning Diverse Skills

π(as,zundefinedtask index)\pi(a|s,\underbrace{z}_{\mathclap{\text{task index}}})

  • Reaching diverse goals is not the same as performing diverse tasks
  • not all behaviors can be captured by goal-reaching
π(as)=arg maxπzEsπ(sz)[r(s,z)]\pi(a|s) = \argmax_{\pi} \sum_z \mathbb{E}_{s \sim \pi(s|z)} [r(s,z)]

We want reward states for other tasks zzz’ \ne z to be unlikely, one way:

Use a classifier that gives you a softmax output of possibility of zzs

r(s,z)=logp(zs)r(s,z) = \log p(z|s)

Connection to mutual information:

Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need. Gregor et al. Variational Intrinsic Control. 2016
I(z,s)=H(z)undefinedmaximized by using uniform prior p(z)H(zs)undefinedminimized by maximizing logp(zs)I(z,s) = \underbrace{H(z)}_{\text{maximized by using uniform prior $p(z)$}} - \underbrace{H(z|s)}_{\text{minimized by maximizing $\log p(z|s)$}}

Offline RL (Batch RL / fully off-policy RL)

Huge gap in DL and DRL ⇒ supervised learning has tons of data!

Formally:

D={(si,ai,si,ri)}sdπβ(s)aπβ(as)rr(s,a)D = \{ (s_i,a_i,s_i',r_i) \} \\ s \sim d^{\pi_\beta}(s) \\ a \sim \pi_\beta(a|s) \\ r \leftarrow r(s,a)
πβ\pi_\beta is the unknown policy that collected the data

Off-Policy Evaluation (OPE)

Given DD, estimate return of some policy J(πβ)J(\pi_\beta)

Offline Reinforcement Learning

Given DD, learn the best possible policy πθ\pi_\theta (such that the evidence of the model is good is supported by the dataset)

How is this even possible?

  1. Find the “good stuff” in a dataset full of good and bad behaviors
  1. Generalization: Good behavior in one place may suggest good behavior in another place
  1. “Stitching”: Parts of good behaviors (even badly-performed parts can have good behaviors) can be recombined
  • Bad intuition: It’s like imitation learning
    • Though it can be shown to be provably better than imitation learning even with optimal data, under some structural assumptions ⇒ Kumar, Hong, Singh, Levine. Should I Run Offline Reinforcement Learning or Behavioral Cloning?
  • Better Intuition: Get order from chaos
    • e.g. Get object from a drawer ⇒ Singh, Yu, Yang, Zhang, Kumar, Levine. COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning.

Behavior Cloning
“Stiching”

What can go wrong if we just use off-policy RL (on data collected by other agents)?

Kalashnikov, Irpan, Pastor, Ibarz, Herzong, Jang, Quillen, Holly, Kalakrishnan, Vanhoucke, Levine. QT-Opt: Scalable Deep Reinforcement Learning of Vision-based Robotic Manipulation Skills

So it works in practice, but seems like a huge GAP between not-fine-tuned and fine-tuned result?

Kumar, Fu, Tucker, Levine. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. NeurIPS ‘19
Q-function overfitting with an off-policy actor-critic algorithm
Fundamental Problem: Counterfactual queries

Online RL algorithms don’t have to handle this because they can simply try this action and see what happens ⇒ not possible for offline policies

Offline RL methods must somehow account for these unseen (”out-of-distribution”) actions, ideally in a safe way

Distribution Shift

In a supervised learning setting, we want to perform empirical risk minimization (ERM):

θarg minθExp(x),yp(yx)[(fθ(x)y)2]\theta \leftarrow \argmin_\theta \mathbb{E}_{x \sim p(x), y\sim p(y|x)} [(f_\theta(x)-y)^2]

Given some xx^*

Exp(x),yp(yx)[(fθ(x)y)2] is low(since distribution is same as training set)Expˉ(x),yp(yx)[(fθ(x)y)2] is generally not if pˉ(x)p(x)\mathbb{E}_{x \sim p(x), y\sim p(y|x)} [(f_\theta(x)-y)^2] \text{ is low(since distribution is same as training set)} \\ \mathbb{E}_{x \sim \bar{p}(x), y \sim p(y|x)} [(f_\theta(x)-y)^2] \text{ is generally not if $\bar{p}(x) \ne p(x)$}

Yes, neural nets generalize well ⇒ but is is well enough?

If we pick xarg maxxfθ(x)x^* \leftarrow \argmax_x f_\theta(x)

Blue curve is the fitted and the green is the actual

When we choose argmax we are probably including a huge positive noise!

Q-Learning Actor-Critic with Distribution Shift

Q-objective:

minQE(s,a)πβ(s,a)[(Q(s,a)y(s,a))2]\min_{Q} \mathbb{E}_{(s,a) \sim \pi_\beta(s,a)}[(Q(s,a)-y(s,a))^2]

The function approximates:

Q(s,a)r(s,a)+Eaπnew[Q(s,a)]Q(s,a) \leftarrow r(s,a) + \mathbb{E}_{a' \sim \pi_{new}}[Q(s',a')]

The distribution shift problem kicks in when

πnewπβ\pi_{new} \ne \pi_\beta

And even worse,

πnew=arg maxπEaπ(as)[Q(s,a)]\pi_{new} = \argmax_\pi \mathbb{E}_{a \sim \pi(a|s)} [Q(s,a)]

Online RL Setting (Sample “optimal” and find error in our approximation)

Offline Setting (No way to find error in our approximation)

🧙🏽‍♂️
Existing Challenges with sampling error and function approximation error in standard RL become much more severe in offline RL

Batch RL via Important Sampling (Traditional Offline Algorithms)

Importance Sampling:

θJ(θ)1Ni=1Nπθ(τi)πβ(τi)undefinedimportance weightt=0Tθγtlogπθ(at,ist,i)Q^(st,i,at,i)\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \underbrace{\frac{\pi_\theta(\tau_i)}{\pi_\beta(\tau_i)}}_{\mathclap{\text{importance weight}}} \sum_{t=0}^T \nabla_\theta \gamma^t \log \pi_\theta(a_{t,i}|s_{t,i})\hat{Q}(s_{t,i},a_{t,i})

Where the importance weight, is exponential in T

πθ(τ)πβ(τ)=p(s1)tp(st+1st,at)πθ(atst)p(s1)tp(st+1st,at)πβ(atst)O(eT)\frac{\pi_\theta(\tau)}{\pi_\beta(\tau)} = \frac{p(s_1) \prod_t p(s_{t+1}|s_t,a_t)\pi_\theta(a_t|s_t)}{p(s_1)\prod_t p(s_{t+1}|s_t,a_t) \pi_\beta(a_t|s_t)} \in O(e^T)

Can we fix this?

θJ(θ)1Ni=1Nt=0T(t=0t1πθ(at,ist,i)πβ(at,ist,i))undefinedaccounts for difference in probability of landing in st,iθγtlogπθ(at,ist,i)(t=tTπθ(at,ist,i)πβ(at,ist,i))undefinedaccounts for incorrect Q^(st,i,at,i)Q^(st,i,at,i)\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^T \underbrace{(\prod_{t'=0}^{t-1} \frac{\pi_\theta(a_{t',i}|s_{t',i})}{\pi_\beta(a_{t',i}|s_{t',i})})}_{\mathclap{\text{accounts for difference in probability of landing in $s_{t,i}$}}} \nabla_\theta \gamma^t \log \pi_\theta(a_{t,i}|s_{t,i}) \underbrace{(\prod_{t'=t}^T\frac{\pi_\theta(a_{t',i}|s_{t',i})}{\pi_\beta(a_{t',i}|s_{t',i})}) }_{\mathclap{\text{accounts for incorrect $\hat{Q}(s_{t,i},a_{t,i})$}}}\hat{Q}(s_{t,i},a_{t,i})

Classic Advanced Policy Gradient RL completely strips away the first product term ⇒ assuming that the new policy πθ\pi_\theta is not too far away from πβ\pi_\beta

We can estimate

Q^(st,i,at,i)=Eπθ[t=tTγttrt]t=tTγttrt,i\hat{Q}(s_{t,i},a_{t,i}) = \mathbb{E}_{\pi_\theta}[\sum_{t'=t}^T \gamma^{t'-t} r_{t'}] \approx \sum_{t'=t}^T \gamma^{t'-t} r_{t',i}

We can simplify the portion (using the fact that future actions don’t affect current reward)

(t=tTπθ(at,ist,i)πβ(at,ist,i))Q^(st,i,at,i)=t=tT(t=tTπθ(at,ist,i)πβ(at,ist,i))Q^(st,i,at,i)t=tT(t=ttπθ(at,ist,i)πβ(at,ist,i))Q^(st,i,at,i)(\prod_{t'=t}^T\frac{\pi_\theta(a_{t',i}|s_{t',i})}{\pi_\beta(a_{t',i}|s_{t',i})})\hat{Q}(s_{t,i},a_{t,i}) = \sum_{t'=t}^T(\prod_{t''=t}^T\frac{\pi_\theta(a_{t'',i}|s_{t'',i})}{\pi_\beta(a_{t'',i}|s_{t'',i})})\hat{Q}(s_{t,i},a_{t,i}) \approx \sum_{t'=t}^T(\prod_{t''=t}^{t'}\frac{\pi_\theta(a_{t'',i}|s_{t'',i})}{\pi_\beta(a_{t'',i}|s_{t'',i})})\hat{Q}(s_{t',i},a_{t',i})

But this is still exponential in TT

🧙🏽‍♂️
To avoid exponentially exploding importance weights, we must use value function estimation!

The Doubly Robust Estimator

Jiang, N. and Li, L. (2015). Doubly robust off-policy value evaluation for reinforcement learning
Vπθ(s0)t=0T(t=0tπθ(atst)πβ(atst))γtrt=t=0T(t=0tρt)γtrt=ρ0r0+ρ0γρ1r1+ρ0γρ1γρ2r2+=ρ0(r0+γ(ρ1(r1+...)))=VˉT\begin{split} V^{\pi_\theta}(s_0) &\approx \sum_{t=0}^T (\prod_{t'=0}^t \frac{\pi_\theta(a_{t'}|s_{t'})}{\pi_\beta(a_{t'}|s_{t'})})\gamma^t r_t \\ &= \sum_{t=0}^T(\prod_{t'=0}^t \rho_{t'}) \gamma^t r_t \\ &= \rho_0 r_0 + \rho_0\gamma\rho_1r_1 + \rho_0\gamma \rho_1 \gamma \rho_2 r_2 + \dots \\ &=\rho_0(r_0 + \gamma(\rho_1(r_1+...))) \\ &=\bar{V}^T \end{split}

We notice a recursion relationship:

VˉT+1t=ρt(rt+γVˉTt)\bar{V}^{T+1-t} = \rho_t(r_t + \gamma \bar{V}^{T-t})

Bandit case of doubly robust estimation:

VDR(s)=V^(s)+ρ(s,a)(rs,aQ^(s,a))V_{DR}(s) = \hat{V}(s) + \rho(s,a)(r_{s,a} - \hat{Q}(s,a))

Doubly Robust Estimation:

VˉDRT+1t=V^(st)+ρt(rt+γVˉDRTtQ^(st,at))\bar{V}_{DR}^{T+1-t} = \hat{V}(s_t) + \rho_t(r_t + \gamma \bar{V}_{DR}^{T-t} - \hat{Q}(s_t,a_t))

Marginalized Important Sampling

In previous times we were trying to do importance sampling on action distributions ⇒ but it’s actually possible to do it on state distributions
🧙🏽‍♂️
Instead of using tπθ(atst)πβ(atst)\prod_t \frac{\pi_\theta(a_t|s_t)}{\pi_\beta(a_t|s_t)}, estimate w(s,a)=dπθ(s,a)dπβ(s,a)w(s,a) = \frac{d^{\pi_\theta}(s,a)}{d^{\pi_\beta}(s,a)}

How to determine w(s,a)w(s,a)?

Suggested Reading

Batch RL via Linear Fitted Value Functions

Assume we have a feature matrix

ΦRS×K\Phi \in \mathbb{R}^{|S| \times K}

where S|S| is the number of states, and KK is the number of features. Then in order to do model-based RL in feature space

  1. Estimate the reward
    1. Φwrr\Phi w_r \approx rwr=(ΦΦ)1Φrw_r = (\Phi^{\top} \Phi)^{-1} \Phi^{\top} \vec{r}
  1. Estimate the transitions
    1. ΦPΦPπΦ\Phi P_{\Phi} \approx P^{\pi} \Phi
      1. PΦP_\Phi ⇒ estimated feature space transition matrix RK×K\in \mathbb{R}^{K \times K}
      1. Pϕ=(ΦΦ)1ΦPπΦP_\phi = (\Phi^{\top} \Phi)^{-1} \Phi^{\top} P^{\pi} \Phi
    1. PπRS×SP^\pi \in \mathbb{R}^{|S| \times |S|} ⇒ Real transition matrix (on states), dependent on the policy
  1. Estimate the value function
    1. VπVΦπ=ΦwVV^{\pi} \approx V_{\Phi}^\pi = \Phi w_V
      1. Solving for VπV^\pi in terms of PπP^\pi and rr:
        1. Vπ=r+γPΠVπV^\pi = r + \gamma P^\Pi V^\pi
        1. (1γPπ)Vπ=r(1-\gamma P^\pi) V^\pi = r
        1. Vπ=(IγPπ)1rV^\pi = (I - \gamma P^\pi)^{-1}r
    1. wV=(IγPΦ)1wr=(ΦΦγΦPπΦ)1Φrw_V = (I - \gamma P_\Phi)^{-1} w_r = (\Phi^\top \Phi-\gamma \Phi^\top P^\pi \Phi)^{-1} \Phi^\top \vec{r} ⇒ Least-sqaures Temporal Difference (LSTD)
    1. But we don’t know PπP^\pi
    1. With samples introduced, we will change our feature matrix ΦRD×K\Phi \in \mathbb{R}^{|D| \times K}
    1. And now we are going to replace PπΦP^\pi \Phi with Φ\Phi’ such that each row of Φi\Phi’_i is feature at the next timestep Φi=ϕ(si)\Phi_i’ = \phi(s_i’)
      1. However, if we change policy, we no longer have PπP^\pi
    1. Also replace ri=r(si,ai)\vec{r}_i = r(s_i,a_i)
    1. Everything else works exactly same, but only that we now have some sampling error
  1. Improve the policy
    1. π(s)Greedy(ΦwV)\pi’(s) \leftarrow Greedy(\Phi w_V)

Least Squares Policy Iteration (LSPI)

🧙🏽‍♂️
Replace LSTD with LSTDQ - LSTD but for Q functions
wQ=(ΦΦγΦΦundefinedΦi=ϕ(si,π(si)))1Φrw_Q = (\Phi^\top \Phi-\gamma \Phi^\top \underbrace{\Phi'}_{\mathclap{\Phi'_i = \phi(s_i',\pi(s_i'))}})^{-1} \Phi^\top \vec{r}

LSPI:

  1. Compute wQw_Q for πk\pi_k
  1. πk+1(s)=arg maxaϕ(s,a)wQ\pi_{k+1}(s) = \argmax_a \phi(s,a)w_Q
  1. Set Φi=ϕ(si,πk+1(si))\Phi_i’ = \phi(s_i’, \pi_{k+1}(s_i’))

Problem:

Policy Constraints

Very old idea, generally only implement this would not work wella
Qr(s,a)+Eaπnew[Q(s,a)]Q \leftarrow r(s,a) + \mathbb{E}_{a' \sim \pi_{new}}[Q(s',a')]
πnew(as)=arg maxπEaπ(as)[Q(s,a)] s.t. DKL(π,πβ)ϵ\pi_{new}(a|s) = \argmax_\pi \mathbb{E}_{a \sim \pi(a|s)}[Q(s,a)] \text{ s.t. } D_{KL}(\pi,\pi_\beta) \le \epsilon

DKL(π,πβ)D_{KL}(\pi, \pi_\beta)

So maybe instead we want to pose a “support constraint”

π(as)0 iff πβ(as)ϵ\pi(a|s) \ge 0 \text{ iff } \pi_\beta(a|s) \ge \epsilon

How to implement?

Implementing Implicit Policy Constraint in practice
  1. Modify the actor objective
    1. Common objective θarg maxθEsD[Eaπθ(as)[Q(s,a)]]\theta \leftarrow \argmax_\theta \mathbb{E}_{s \sim D} [\mathbb{E}_{a \sim \pi_\theta(a|s)}[Q(s,a)]]
    1. DKL(π,πβ)=E[logπ(as)logπβ(as)]=Eπ[logπβ(as)]H(π)D_{KL}(\pi, \pi_\beta) = \mathbb{E}[\log \pi(a|s) - \log \pi_\beta (a|s)] = -\mathbb{E}_\pi[\log \pi_\beta(a|s)] - H(\pi)
    1. So we can incorporate the KL divergence into the actor objective
    1. Use Lagrange Multiplier
      1. θarg maxθEsD[Eaπθ(as)[Q(s,a)+λlogπβ(as)]+λH(π(as))]\theta \leftarrow \argmax_\theta \mathbb{E}_{s \sim D} [\mathbb{E}_{a \sim \pi_\theta(a|s)}[Q(s,a) + \lambda \log \pi_\beta (a|s)]+\lambda H(\pi (a|s))]
      1. In practice we either solve the lagrange multiplier or treat it as hyper-parameter and tune it
  1. Modify the reward funciton
    1. rˉ(s,a)=r(s,a)D(π,πβ)\bar{r}(s,a) = r(s,a) - D(\pi, \pi_\beta)
    1. Simple modification to directly penalize divergence
    1. Also accounts for future divergence (in later steps)
    1. Wu, Tucker, Nachum, Behavior Regularized Offline Reinforcement Learning. ‘19
  1. Implicit Policy Constraint Methods
    1. Solve for lagrangian condition of the KL constrained problem (straightforward to show via duality), we have
    1. π(as)=1Z(s)πβ(as)exp(1λAπ(s,a))\pi^*(a|s) = \frac{1}{Z(s)} \pi_\beta (a|s) \exp(\frac{1}{\lambda} A^\pi (s,a))
    1. Peters et al. (REPS)
    1. Rawlik et al. (”psi-learning”)
    1. We can do this by
      1. Approximate via weighted max likelihood
      1. πnew(as)=arg maxπE(s,a)πβundefined(s,a)D[logπ(as)1Z(s)exp(1λAπold(s,a))undefinedw(s,a)]\pi_{new}(a|s) = \argmax_{\pi} \mathbb{E}_{\underbrace{(s,a) \sim \pi_\beta}_{(s,a) \sim D}} [\log \pi(a|s) \underbrace{\frac{1}{Z(s)} \exp(\frac{1}{\lambda} A^{\pi_{old}} (s,a)) }_{w(s,a)}]
      1. Imitating good actions “more”(determined by w(s,a)w(s,a)) than bad actions
    1. Problem: In order to get advantage values / target values we still need to query OOD actions
      1. If we choose λ\lambda correctly the constraints will be respected at convergence, but not necessarily at intermediate steps
🧙🏽‍♂️
Can we avoid ALL OOD actions in the Q update?

Instead of doing

Q(s,a)r(s,a)+Eaπnew[Q(s,a)]Q(s,a) \leftarrow r(s,a) + \mathbb{E}_{a' \sim \pi_{new}}[Q(s',a')]

We can do:

Q(s,a)r(s,a)+V(s)Q(s,a) \leftarrow r(s,a) + V(s')

But how do we fit this VV’ function?

  1. Varg minV1Ni=1Nl(V(si),Q(si,ai))V \leftarrow \argmin_{V} \frac{1}{N} \sum_{i=1}^N l(V(s_i), Q(s_i, a_i))
    1. MSE Loss (V(si)Q(si,ai))2(V(s_i) - Q(s_i, a_i))^2
      1. Problem is ⇒ the actions come from the exploration policy πβ\pi_\beta not from πnew\pi_{new}
      1. Gives us Eaπβ[Q(s,a)]\mathbb{E}_{a \sim \pi_\beta}[Q(s,a)]

Think about this:

Distribution visualization of states
  1. We’ve probably seen only a bit or none of action in the “exact” same state
    1. But there might be very “similar” states that we execute action on ⇒ There’s not bins of states but distributions of states
    1. Can we use those “similar states” as extra data points for us to perform generalization on?
    1. But what if we use a “quantile” / “expectile” ⇒ expectile is like a quantile squared
      1. Upper expectile / quantile is the best policy supported by the data

Implicit Q-Learning

Expectile:

l2τ(x)={(1τ)x2if x>0τx2otherwisel_2^\tau (x) = \begin{cases} (1-\tau)x^2 &\text{if } x>0 \\ \tau x^2 &\text{otherwise} \end{cases}

Intuition:

  1. MSE penalizes positive and negative errors equally
  1. expectile penalizes positive and negative with different weights depending on τ[0,1]\tau \in [0,1]
Weights given to positive and negative erros in expectile regression. Taken from IQL paper

Formally, it could be shown that if we use

Varg minVE(s,a)D[l2τ(V(s)Q(s,a))]V \leftarrow \argmin_{V} \mathbb{E}_{(s,a) \sim D} [l_2^\tau(V(s)-Q(s,a))]

In practice, Q(s,a)Q(s,a) would be a target network

Formally, we can prove that provided with a big enough τ\tau,

V(s)maxaΩ(s)Q(s,a)V(s) \leftarrow \max_{a \in \Omega(s)} Q(s,a)

Where

Ω(s)={a:πβ(as)ϵ}\Omega(s) = \{a: \pi_\beta(a|s) \ge \epsilon \}
Kostrikov, Nair, Levine. Offline Reinforcement Learning with Implicit Q-Learning. ‘21
📌
Oh yes, that’s my postdoc’s paper (as the first author)! This is absolutely an amazing paper and he is absolutely an amazing person, also this is one of the papers I read for entering into Sergey’s group. I hope you get to check out this paper!

Conservative Q-Learning (CQL)

Kumar, Zhou, Tucker, Levine. Conservative Q-Learning for Offline Reinforcement Learning
Directly repair overestimated actions in the Q function

Intuition: Push down on areas that have errorneous high Q values

We can formally show that Q^πQπ\hat{Q}^\pi \le Q^\pi for large enough α\alpha

But with this objective, we’re pressing down on actually “good” actions as well ⇒ so add additional terms that pushes up on good samples supported by data

In practice, we add a regularizer R(μ)R(\mu) in LCQL(Q^π)L_{CQL}(\hat{Q}^\pi) so that the μ\mu doesn’t need to be calculated directly.

LCQL(Q^π)=maxμ{αEsD,aμ(as)[Q^π(s,a)]αE(s,a)D[Q(s,a)]}+E(s,a,s)D[Q(s,a)(r(s,a)+E[Q(s,a)]))2]L_{CQL}(\hat{Q}^\pi) = \max_{\mu} \{ \alpha\mathbb{E}_{s\sim D, a\sim \mu(a|s)}[\hat{Q}^\pi(s,a)] - \alpha \mathbb{E}_{(s,a) \sim D}[Q(s,a)] \} + \mathbb{E}_{(s,a,s') \sim D}[Q(s,a) - (r(s,a)+\mathbb{E}[Q(s',a')]))^2]

Common choice: R(μ)=EsD[H(μ(s))]R(\mu) = \mathbb{E}_{s \sim D} [H(\mu(\cdot | s))]

⇒ Then μ(as)exp[Q(s,a)]\mu(a|s) \propto \exp[Q(s,a)]Eaμ(as)[Q(s,a)]=logaexp(Q(s,a))\mathbb{E}_{a \sim \mu(a|s)}[Q(s,a)] = \log \sum_a \exp(Q(s,a))

Model-based Offline RL

What goes wrong when we cannot collect more data?

MOPO: Model-Based Offline Policy Optimization

Yu*, Thomas*, Yu, Ermon, Zou, Levine, Finn, Ma. MOPO: Model-Based Offline Policy Optimization. ‘20 MOReL : Model-Based Offline Reinforcement Learning. ’20 (concurrent)

“punish” the policy for exploiting

r~(s,a)=r(s,a)λu(s,a)\tilde{r}(s,a) = r(s,a) -\lambda u(s,a)

where u(s,a)u(s,a) is model uncertainty about the state and action

uu need to at least as large as the error in the model (according to some kind of divergence) ⇒ still an open problem to get a good estimate

“Use ensemble as an error proxy for the uncertainty metric” ⇒ current method

Theoretical analysis:

In particular, δδmin\forall \delta \ge \delta_{\min}

Some implications:

ηM(π^)ηM(πB)2λϵu(πB)\eta_M(\hat{\pi}) \ge \eta_M(\pi^{B})-2\lambda \epsilon_u(\pi^{B})
We can do almost always better then the behavior(exploration) policy (since δ\delta is very small for πB\pi^B)
ηM(π^)ηM(π)2λϵu(π)\eta_M(\hat{\pi}) \ge \eta_M(\pi^*)-2\lambda \epsilon_u(\pi^*)
Quantifies the optimality “gap” in terms of model error

COMBO: Convervative Model-based RL

Yu, Kumar, Rafailov, Rajeswaran, Levine, Finn. COMBO: Conservative Offline Model-Based Policy Optimization. 2021.
📌
Basic idea: just like CQL minimizes Q-value of policy actions, we can minimize Q-value of model state-action tuples

Intuition: if the model produces something that looks clearly different from real data, it’s easy for the Q-function to make it look bad

Trajectory Transformer

Janner, Li, Levine. Reinforcement Learning as One Big Sequence Modeling Problem. 2021.
Model drew with Auto-regressive sequence model
  1. Train a joint state-action model
    1. pβ(τ)=pβ(s1,a2,,sT,aT)p_\beta(\tau) = p_\beta(s_1,a_2, \dots, s_T, a_T)
    1. Intuitively we want to optimize for a plan for a sequence of actions that has a high probability under the data distribution
  1. Use a big expressive model (Transformer)
    1. Subscript in the model image just shows the “timestep, dimension” of state and action spaces
    1. Diagram is an Auto-regressive sequence model (LSTM)
    1. With transformer need a causal mask
      1. GPT-style model
    1. Because we are regressing on both state and action sequences we can have accurate predictions out to much longer horizons
  1. Control
    1. Beam search, but use tr(st,at)\sum_t r(s_t,a_t) instead of probability
      1. Given current sequence, sample next tokens from model
      1. Store top KK tokens with highest cumulative reward (but as we progress only select high-probability sequences)
    1. Other methods like MCTS works as well

This works because

Which Offline Algorithm to use?

Why offline RL

Open Problems

RL Theory

📌
Effective analysis is very hard in RL without strong assumptions Trick is to make assumptions that admit interesting conclusions without divorcing us (too much) from reality

What is the point:

  1. Prove that our RL algorithms will work perfectly every time
    1. Usually not possible with current deep RL methods, which are often not even guarenteed to converge
  1. Understand how errors are affected by problem parameters
    1. Do larger discounts work better than smaller ones?
    1. If we want half the eror, do we need 2x the samples, 4x, or something else?
    1. Usually we use precise theory to get imprecise qualitative conclusions about how various factors influence the performance of RL algorithms under strong assumptions, and try to make the assumptions reasonable enough that these conclusions are likely to apply to real problems (but they are not guarenteed to apply to real problems)

⚠️
Don’t take someone seriously if they say their RL algorithm has “provable guarantees” the assumptions are always unrealistic, and theory is at best a rough guide to what might happen

Exploration

Performance of RL is greatly complicated by exploration - how likely are we to find potentially sparse rewards?

Theoretical guarantees typically address worst case performance ⇒ but worst case exploration is extremely hard
🧙🏽‍♂️
Goal: Show that exploration method is Poly(S,A,1/(1γ))Poly(|S|, |A|, 1/(1-\gamma))

Basic Sample Complexity Analysis

RL Theory Textbook. Agarwal, Jiang, Kakade, Sun.
https://rltheorybook.github.io

“Oracle Exploration”: for every (s,a)(s,a), sample sP(ss,a)s’ \sim P(s’|s,a) N times

Simple “model-based” algorithm

  1. P^(ss,a)=#(s,a,s)N\hat{P}(s’|s,a) = \frac{\#(s,a,s’)}{N}
    1. Count number of transitions
  1. Given π\pi, use P^\hat{P} to estimate Q^π\hat{Q}^\pi
  1. We want to measure worst case performance
    1. How close is Q^π\hat{Q}^\pi to QπQ^\pi?
    1. How close is Q^\hat{Q}^*(optimal function learned under P^\hat{P}) to QQ^*(optimal function learned under real PP)?
    1. How good is the resulting policy?

“How close is Q^π\hat{Q}^\pi to QπQ^\pi?”

Qπ(s,a)Q^π(s,a)ϵmaxs,aQπ(s,a)Q^(s,a)ϵ\begin{split} ||Q^\pi(s,a) - \hat{Q}^\pi(s,a)||_\infin &\le \epsilon \\ \max_{s,a} |Q^\pi(s,a) - \hat{Q}(s,a)| &\le \epsilon \end{split}

“How close is Q^\hat{Q}^* to QQ^*?”

Q(s,a)Q^(s,a)ϵ||Q^*(s,a) - \hat{Q}^*(s,a)||_\infin \le \epsilon

“How good is the resulting policy?”

Q(s,a)Qπ^(s,a)ϵ||Q^*(s,a) - Q^{\hat{\pi}}(s,a)||_\infin \le \epsilon

Concentration Inequalities

How fast our estimate of random variable converges to the true underlying value (in terms of # of samples)

Hoeffding’s Inequality (For continuous distributions)

Suppose X1,,XnX_1, \dots, X_n are a sequence of independent, identically distributed (i.i.d.) random variables with mean μ\mu. Let Xnˉ=n1i=1nXi\bar{X_n} = n^{-1}\sum_{i=1}^n X_i. Suppose that Xi[b,b+]X_i \in [b_{-},b_{+}] with probability 11, then

P(Xˉnμ+ϵ)exp{2nϵ2(b+b)2}\mathbb{P}(\bar{X}_n \ge \mu + \epsilon) \le \exp \{-\frac{2n\epsilon^2}{(b_{+}-b_{-})^2}\}

Therefore if we estimate μ\mu with nn samples the probability we’re off by more than ϵ\epsilon is at most 2exp{2nϵ2(b+b)2}2 \exp \{ \frac{2n\epsilon^2}{(b_{+}-b_{-})^2} \} (since we can be off on both sides). So if we want this probability to be δ\delta:

δ2exp{2nϵ2(b+b)2}logδ22nϵ2(b+b)2ϵ2(b+b)22nlog2δϵb+b2nlog2δ\begin{split} \delta &\le 2\exp\{-\frac{2n\epsilon^2}{(b_{+}-b_{-})^2}\} \\ \log \frac{\delta}{2} &\le -\frac{2n\epsilon^2}{(b_{+}-b_{-})^2} \\ \epsilon^2 &\le \frac{(b_{+}-b_{-})^2}{2n} \log \frac{2}{\delta} \\ \epsilon &\le \frac{b_{+}-b_{-}}{\sqrt{2n}} \sqrt{\log \frac{2}{\delta}} \end{split}
⚠️
Error ϵ\epsilon scales as 1n\frac{1}{\sqrt{n}}

Concentration for Discrete Distributions

Let zz be a discrete random variable that takes values in {1,2,,d}\{1, 2, \dots, d \}, distributed according to qq. We can write qq as a vector where q=[P(z=j)]j=1d\vec{q} = [\mathbb{P}(z=j)]_{j=1}^d. Assume we have NN iid samples, and that our empiraical state of q\vec{q} is [q^]j=i=1N1{zi=j}/N[\hat{q}]_j = \sum_{i=1}^N 1\{z_i=j\} / N

We have that ϵ>0\forall \epsilon > 0,

P(q^q21N+ϵ)exp{Nϵ2}\mathbb{P}(||\hat{\vec{q}}-\vec{q}||_2 \ge \frac{1}{\sqrt{N}} + \epsilon) \le \exp\{-N\epsilon^2\}

Which implies:

P(q^q1d(1N+ϵ))exp{Nϵ2}\mathbb{P}(||\hat{\vec{q}} - \vec{q}||_1 \ge \sqrt{d}(\frac{1}{\sqrt{N}}+\epsilon)) \le \exp\{-N\epsilon^2\}

Let

δ=P(q^q1d(1N+ϵ))\delta = \mathbb{P}(||\hat{\vec{q}} - \vec{q}||_1 \ge \sqrt{d}(\frac{1}{\sqrt{N}}+\epsilon))

Then

δexp{Nϵ2}ϵ1Nlog1δN1ϵ2log1δ\begin{split} \delta &\le \exp\{-N\epsilon^2\} \\ \epsilon &\le \frac{1}{\sqrt{N}} \sqrt{\log \frac{1}{\delta}} \\ N &\le \frac{1}{\epsilon^2} \log \frac{1}{\delta} \end{split}

Using concentration inequalities, we see that in the “Basic Sample Complexity Analysis”, the state transition estimation difference is

P^(ss,a)P(ss,a)1S(1/N+ϵ)SN+Slog(1/δ)NcSlog(1/δ)N\begin{split} ||\hat{P}(s'|s,a) - P(s'|s,a)||_1 &\le \sqrt{|S|} (1/\sqrt{N} + \epsilon) \\ &\le \sqrt{\frac{|S|}{N}} + \sqrt{\frac{|S|\log (1/\delta)}{N}} \\ &\le c\sqrt{\frac{|S|\log (1/\delta)}{N}} \end{split}

Now: Relate error in P^\hat{P} to error in Q^π\hat{Q}^\pi

Recall:

Qπ(s,a)=r(s,a)+γEsP(ss,a)[Vπ(s)]Qπ(s,a)=r(s,a)+γsP(ss,a)Vπ(s)\begin{split} Q^\pi(s,a) &= r(s,a) + \gamma \mathbb{E}_{s' \sim \mathbb{P}(s'|s,a)}[V^\pi(s')] \\ Q^\pi(s,a) &= r(s,a) + \gamma\sum_{s'} \mathbb{P}(s'|s,a) V^\pi(s') \end{split}

So if we write out the transition fuction PP, the Q function QπQ^\pi, reward function rr, and value function VπV^\pi as matrices:

Dims of each matrix
Qπ=r+γPVπQ^\pi = r + \gamma PV^\pi

Remember that VV function is just an expectation over the QQ function (a sum) ⇒ so we can also write it as an expectation (where ΠRS×(SA)\Pi \in \mathbb{R}^{|S| \times (|S||A|)} and is a probability matrix)

Vπ=ΠQπV^\pi = \Pi Q^\pi

With this, (and Pπ=PΠP^\pi = P \Pi)

Qπ=r+γPπQπQ^\pi = r + \gamma P^\pi Q^\pi

So

(IγPπ)Qπ=rQπ=(IγPπ)1r\begin{split} (I - \gamma P^\pi)Q^\pi &= r \\ Q^\pi &= (I-\gamma P^\pi)^{-1} r \end{split}

And turns out the matrix (IγPπ)(I - \gamma P^\pi) is always gonna be invertible

Simulation Lemma:

QπQ^π=γ(IγP^π)1undefinedEvaluation Operator: Turns reward function into Q function(PP^)undefinedDifference in model probabilityVπQ^\pi - \hat{Q}^\pi = \gamma \overbrace{(I-\gamma \hat{P}^\pi)^{-1}}^{\mathclap{\text{Evaluation Operator: Turns reward function into Q function}}} \underbrace{(P-\hat{P})}_{\mathclap{\text{Difference in model probability}}} V^\pi

Proof:

QπQ^π=Qπ(IγP^π)1r=(IγP^π)1(IγP^π)Qπ(IγP^π)1r=(IγP^π)1(IγP^π)Qπ(IγP^π)1(IγPπ)Qπ=(IγP^π)1[(IγP^π)(IγPπ)]Qπ=γ(IγP^π)1(PπP^π)Qπ=γ(IγP^π)1(PΠP^Π)Qπ=γ(IγP^π)1(PP^)ΠQπ=γ(IγP^π)1(PP^)Vπ\begin{split} Q^\pi - \hat{Q}^\pi &= Q^\pi - (I-\gamma \hat{P}^\pi)^{-1}r \\ &=(I-\gamma\hat{P}^\pi)^{-1}(I - \gamma \hat{P}^\pi)Q^\pi - (I-\gamma \hat{P}^\pi)^{-1} r \\ &=(I-\gamma\hat{P}^\pi)^{-1}(I - \gamma \hat{P}^\pi)Q^\pi - (I-\gamma \hat{P}^\pi)^{-1} (I-\gamma P^\pi) Q^\pi \\ &=(I-\gamma\hat{P}^\pi)^{-1} [(I-\gamma \hat{P}^\pi) - (I-\gamma P^\pi)] Q^\pi \\ &=\gamma(I-\gamma\hat{P}^\pi)^{-1} (P^\pi - \hat{P}^\pi)Q^\pi \\ &=\gamma(I-\gamma\hat{P}^\pi)^{-1} (P \Pi - \hat{P} \Pi)Q^\pi \\ &=\gamma(I-\gamma\hat{P}^\pi)^{-1} (P - \hat{P})\Pi Q^\pi \\ &=\gamma(I-\gamma\hat{P}^\pi)^{-1} (P- \hat{P})V^\pi \\ \end{split}

Another Lemma:

Given PπP^\pi and any vector vRSA\vec{v} \in \mathbb{R}^{|S||A|}, we have

(IγPπ)1vv1γ||(I-\gamma P^\pi)^{-1}\vec{v}||_\infin \le \frac{||\vec{v}||_\infin}{1-\gamma}

“Q function” corresponding to “reward” v\vec{v} is at most (1γ)1(1-\gamma)^{-1} times larger

Where does this (1γ)1(1-\gamma)^{-1} come from? Geometric series of t=0γtc=c1γ\sum_{t=0}^\infin \gamma^t c = \frac{c}{1-\gamma}

Let

w=(IγPπ)1v\vec{w} = (I\gamma P^\pi)^{-1}\vec{v}

Then

v=(IγPπ)wundefinedTriangular InequalitywγPπwwγw(1γ)w\begin{split} ||\vec{v}||_\infin &= ||(I-\gamma P^\pi)\vec{w}||_\infin \\ &\overbrace{\ge}^{\mathclap{\text{Triangular Inequality}}} ||\vec{w}||_\infin - \gamma ||P^\pi \vec{w}||_\infin \\ &\ge ||\vec{w}||_\infin - \gamma||\vec{w}||_\infin \\ &\ge (1-\gamma) ||\vec{w}||_\infin \end{split}

Putting the lemmas together,

(IγPπ)1vv\begin{split} ||(I-\gamma P^\pi)^{-1}\vec{v}||_\infin &\le ||\vec{v}||_\infin \\ \end{split}

Let the special case v=γ(PP^)Vπ\vec{v} = \gamma(P-\hat{P})V^\pi

QπQ^π=γ(IγP^π)1(PP^)VπQπQ^π=γ(IγP^π)1(PP^)Vπγ1γ(PP^)Vπγ1γ(maxs,aP(s,a)P^(s,a)1)Vπ\begin{split} Q^\pi - \hat{Q}^\pi &= \gamma (I-\gamma \hat{P}^\pi)^{-1}(P-\hat{P})V^\pi \\ ||Q^\pi - \hat{Q}^\pi||_\infin &= ||\gamma(I-\gamma \hat{P}^\pi)^{-1}(P-\hat{P})V^\pi||_\infin \\ &\le \frac{\gamma}{1-\gamma} ||(P-\hat{P})V^\pi||_\infin \\ &\le \frac{\gamma}{1-\gamma}(\max_{s,a} ||P(\cdot |s,a) - \hat{P}(\cdot|s,a)||_1) ||V^\pi||_\infin \end{split}

We can bound Vπ||V^\pi||_\infin

t=0γtrt11γRmax\sum_{t=0}^\infin \gamma^t r_t \le \frac{1}{1-\gamma}R_{max}

With this bound,

QπQ^πγ(1γ)2Rmax(maxs,aP(s,a)P^(s,a)1)||Q^\pi-\hat{Q}^\pi||_\infin \le \frac{\gamma}{(1-\gamma)^2} R_{max} (\max_{s,a} ||P(\cdot |s,a) - \hat{P}(\cdot|s,a)||_1)

With the previous bound on

P(,s,a)P^(s,a)1SN+Slog(1/δ)NcSlog(1/δ)N\begin{split} ||P(\cdot|,s,a) - \hat{P}(\cdot|s,a)||_1 &\le \sqrt{\frac{|S|}{N}} + \sqrt{\frac{|S|\log (1/\delta)}{N}} \\ &\le c\sqrt{\frac{|S|\log (1/\delta)}{N}} \end{split}

We conclude:

QπQ^πϵ=γ(1γ)2Rmaxc2Slog(δ1)N||Q^\pi-\hat{Q}^\pi||_\infin \le \epsilon = \frac{\gamma}{(1-\gamma)^2} R_{max} c_2 \sqrt{\frac{|S| \log(\delta^{-1})}{N}}
⚠️
Error grows quadratically in the horizon (1γ)2(1-\gamma)^{-2}, each backup accumulates error

What about QQ^||Q^* - \hat{Q}^*||_\infin?

Supremium - lower upper bound of a function (greatest value of a function), one useful identity

supxf(x)supxg(x)supxf(x)g(x)|\sup_x f(x) - \sup_x g(x)| \le \sup_x |f(x)-g(x)|

And

QQ^=supπQπsupπQ^πsupπQπQ^πϵ||Q^*-\hat{Q}^*||_\infin = ||\sup_\pi Q^\pi - \sup_\pi \hat{Q}^\pi||_\infin \le \sup_\pi ||Q^\pi - \hat{Q}^\pi||_\infin \le \epsilon

What about QQπ^||Q^* - Q^{\hat{\pi}^*}||_\infin?

QQπ^=QQ^π^+Q^π^Qπ^QQ^π^undefinedQ^+Q^π^Qπ^undefinedSame Policy2ϵ\begin{split} ||Q^* - Q^{\hat{\pi}^*}||_\infin &= ||Q^* - \hat{Q}^{\hat{\pi}^*}+\hat{Q}^{\hat{\pi}^*} - Q^{\hat{\pi}^*}||_\infin \\ &\le ||Q^* - \underbrace{\hat{Q}^{\hat{\pi}^*}}_{\hat{Q}^*}||_\infin + ||\underbrace{\hat{Q}^{\hat{\pi}^*} - Q^{\hat{\pi}^*}}_{\mathclap{\text{Same Policy}}}||_\infin \\ &\le 2\epsilon \end{split}

Fixed Q-Iteration

Abstract model of exact Q-iteration

Qk+1TQk=r+γPmaxaQkQ_{k+1} \leftarrow TQ_k = r+\gamma P \max_{a} Q_k

Abstract model of approximate fitted Q-iteration:

Q^k+1arg minQ^Q^T^Q^k\hat{Q}_{k+1} \leftarrow \argmin_{\hat{Q}} ||\hat{Q} - \hat{T}\hat{Q}_k||

Where T^\hat{T} is the approximate bellman operator:

T^Q=r^undefinedr^(s,a)=1N(s,a)iδ((si,ai)=(s,a))ri+γP^undefinedP^(ss,a)=N(s,a,s)N(s,a)maxaQ\hat{T}Q = \underbrace{\hat{r}}_{\mathclap{\hat{r}(s,a) = \frac{1}{N(s,a)}\sum_i \delta((s_i,a_i) = (s,a)) r_i}} + \gamma \overbrace{\hat{P}}^{\mathclap{\hat{P}(s'|s,a) = \frac{N(s,a,s')}{N(s,a)}}} \max_a Q

Note: We will not be computing P^\hat{P} and r^\hat{r}, it would just be the effect of averaging together transitions in the data (having different gradients for different samples)

We will assume an infinity norm here because there will be no convergnce if using L2 norm

Erorrs come from

Sampling Error:

T^Q(s,a)TQ(s,a)=r^(s,a)r(s,a)+γ(EP^[maxaQ(s,a)]EP[maxaQ(s,a)])r^(s,a)r(s,a)+γEP^[maxaQ(s,a)]EP[maxaQ(s,a)]\begin{split} |\hat{T}Q(s,a) - TQ(s,a)| &= |\hat{r}(s,a)-r(s,a)+\gamma(\mathbb{E}_{\hat{P}}[\max_{a'} Q(s',a')] - \mathbb{E}_{P}[\max_{a'} Q(s',a')])| \\ &\le |\hat{r}(s,a) - r(s,a)| + \gamma {|\mathbb{E}_{\hat{P}}[\max_{a'}Q(s',a')]-\mathbb{E}_{P}[\max_{a'}Q(s',a')]|} \\ \end{split}

Note:

So

T^Q(s,a)TQ(s,a)2Rmaxlog(δ1)2N+cQlog(δ1)N|\hat{T}Q(s,a) -TQ(s,a)| \le 2R_{max} \sqrt{\frac{\log(\delta^{-1})}{2N}} + c||Q||_\infin \sqrt{\frac{\log(\delta^{-1})}{N}}

And

T^QTQ2Rmaxc1log(SAδ)2N+c2Qlog(Sδ)N||\hat{T}Q - TQ||_\infin \le 2R_{max} c_1 \sqrt{\frac{\log(\frac{|S||A|}{\delta})}{2N}} + c_2||Q||_\infin \sqrt{\frac{\log(\frac{|S|}{\delta})}{N}}

Approximation Error:

Assume error: Q^k+1TQ^kϵk||\hat{Q}_{k+1} - T\hat{Q}_k||_\infin||_\infin \le \epsilon_k

This is a strong assumption!

Q^kQ=Q^kTQ^k1+TQ^k1Q=(Q^kTQ^k1)+(TQ^k1TQundefinedUsing the fact that Q is a fixed point of T)Q^kTQ^k1+TQ^k1TQϵk1+γQ^k1QundefinedUsing the fact that T is a γ-contractionUnroll the recursion,i=0k1γiϵki1+γQ^k1Q\begin{split} ||\hat{Q}_k - Q^*||_\infin &= ||\hat{Q}_k - T\hat{Q}_{k-1} + T\hat{Q}_{k-1}-Q^*||_\infin \\ &= ||(\hat{Q}_k - T\hat{Q}_{k-1}) + (T\hat{Q}_{k-1} - \underbrace{TQ^*}_{\mathclap{\text{Using the fact that $Q^*$ is a fixed point of $T$}}})||_\infin \\ &\le ||\hat{Q}_k - T\hat{Q}_{k-1}||_\infin + ||T\hat{Q}_{k-1} - TQ^*||_\infin \\ &\le \epsilon _{k-1} + \underbrace{\gamma ||\hat{Q}_{k-1}- Q^*||_\infin}_{\mathclap{\text{Using the fact that $T$ is a $\gamma$-contraction}}} \\ \text{Unroll the recursion,} \\ &\le \sum_{i=0}^{k-1} \gamma^i \epsilon_{k-i-1} + \gamma ||\hat{Q}_{k-1} - Q^*||_\infin \\ \end{split}

limkQ^kQi=0γimaxkϵk=11γϵ\lim_{k \to \infin} ||\hat{Q}_k - Q^*||_\infin \le \sum_{i=0}^{\infin} \gamma^i \max_{k} \epsilon_k = \frac{1}{1-\gamma}||\epsilon||_\infin

Putting it together:

Q^kTQ^k1=Q^kT^Q^k1+T^Q^k1TQ^k1Q^kT^Q^k1undefinedApproximation Error+T^Q^k1TQ^k1undefinedSampling Error\begin{split} ||\hat{Q}_k - T\hat{Q}_{k-1}||_\infin &= ||\hat{Q}_k - \hat{T}\hat{Q}_{k-1} + \hat{T}\hat{Q}_{k-1} - T\hat{Q}_{k-1}||_\infin \\ &\le \underbrace{||\hat{Q}_k - \hat{T}\hat{Q}_{k-1}||_\infin}_{\mathclap{\text{Approximation Error}}} + \underbrace{||\hat{T}\hat{Q}_{k-1} - T\hat{Q}_{k-1}||_\infin}_{\mathclap{\text{Sampling Error}}} \\ \end{split}

Note:

From previous proofs, Sampling Error,Approx ErrorO((1γ)1)\text{Sampling Error}, \text{Approx Error} \in O((1-\gamma)^{-1})

More advanced results can be derived with pp-norms under some distribution

Variational Inference

No RL content in this chapter, but heavy links to RL algorithms

Latent Variable Model

We have data points xx, we want to model a distribution of xx with the help of latent variable zz.

How do we train latent variable models?

Estimating the log-likelihood

🧙🏽‍♂️
Use Expected Log-likelihood
θarg maxθ1NiEzp(zxi)[logpθ(xi,z)]\theta \leftarrow \argmax_\theta \frac{1}{N} \sum_i \mathbb{E}_{z \sim p(z|x_i)}[\log p_\theta(x_i,z)]

Intuition:

logp(xi)=logzp(xiz)p(z)=logzp(xiz)p(z)qi(z)qi(z)=logEzqi(z)[p(xiz)p(z)qi(z)]\begin{split} \log p(x_i) &= \log \int_z p(x_i|z) p(z) \\ &= \log \int_z p(x_i|z)p(z)\frac{q_i(z)}{q_i(z)} \\ &=\log \mathbb{E}_{z \sim q_i(z)} [\frac{p(x_i|z)p(z)}{q_i(z)}] \end{split}

Note:

Yensen’s Inequality (For concave function)

logE[y]E[logy]\log \mathbb{E}[y] \ge \mathbb{E}[\log y]

So

logp(xi)=logEzqi(z)[p(xiz)p(z)qi(z)]Ezqi(z)[logp(xiz)p(z)qi(z)]=Ezqi(z)[logp(xiz)+logp(z)]Ezqi(z)[logqi(z)]=Ezqi(z)[logp(xiz)+logp(z)]+H(qi)\begin{split} \log p(x_i) &=\log \mathbb{E}_{z \sim q_i(z)} [\frac{p(x_i|z)p(z)}{q_i(z)}] \\ &\ge \mathbb{E}_{z \sim q_i(z)} [\log \frac{p(x_i|z)p(z)}{q_i(z)}] \\ &\quad = \mathbb{E}_{z \sim q_i(z)}[\log p(x_i|z) + \log p(z)] - \mathbb{E}_{z \sim q_i(z)} [\log q_i(z)] \\ &\quad = \mathbb{E}_{z \sim q_i(z)}[\log p(x_i|z) + \log p(z)] + H(q_i) \end{split}

Note about entropy:

H(p)=Exp(x)[logp(x)]=xp(x)logp(x)dxH(p) = -\mathbb{E}_{x \sim p(x)}[\log p(x)] = -\int_x p(x)\log p(x) dx

Intuitions:

Also note about KL-Divergence:

DKL(qp)=Exq(x)[logq(x)p(x)]=Exq(x)[logq(x)]Exq(x)[logp(x)]=Exq(x)[logp(x)]H(q)D_{KL}(q ||p) = \mathbb{E}_{x \sim q(x)}[log \frac{q(x)}{p(x)}] = \mathbb{E}_{x \sim q(x)} [\log q(x)] - \mathbb{E}_{x \sim q(x)}[\log p(x)] = -\mathbb{E}_{x \sim q(x)}[\log p(x)] - H(q)

Intuitions:

So the variational approximation

logp(xi)Ezqi(z)[logp(xiz)+logp(z)]+H(qi)undefinedLi(p,qi)\begin{split} \log p(x_i) &\ge \overbrace{\mathbb{E}_{z \sim q_i(z)}[\log p(x_i|z) + \log p(z)] + H(q_i)}^{L_i(p,q_i)} \\ \end{split}

So what makes a good qi(z)q_i(z)?

Why?

DKL(qi(z)p(zxi))=Ezqi(z)[logqi(z)p(zxi)]=Ezqi(z)[logqi(z)p(xi)p(xi,z)]=Ezqi(z)[logp(xiz)]+logp(z)]+Ezqi(z)[logqi(z)]+Ezqi(z)[logp(xi)]=Ezqi(z)[logp(xiz)]+logp(z)]H(qi)+Ezqi(z)[logp(xi)]=Li(p,qi)+logp(xi)\begin{split} D_{KL}(q_i(z) || p(z|x_i)) &= \mathbb{E}_{z \sim q_i(z)}[\log \frac{q_i(z)}{p(z|x_i)}] = \mathbb{E}_{z \sim q_i(z)}[\log \frac{q_i(z) p(x_i)}{p(x_i,z)}] \\ &= - \mathbb{E}_{z \sim q_i(z)}[\log p(x_i|z)] + \log p(z)] + \mathbb{E}_{z \sim q_i(z)}[\log q_i(z)] + \mathbb{E}_{z \sim q_i(z)}[\log p(x_i)] \\ &= - \mathbb{E}_{z \sim q_i(z)}[\log p(x_i|z)] + \log p(z)] - H(q_i) + \mathbb{E}_{z \sim q_i(z)}[\log p(x_i)] \\ &=- L_i(p,q_i) + \log p(x_i) \end{split}

Also:

logp(xi)=DKL(qi(xi)p(zxi))+Li(p,qi)\log p(x_i) = D_{KL}(q_i(x_i)||p(z|x_i)) + L_i(p,q_i)

In fact minimizing the KL-divergence has the effect of tightening then bound!

But how to improve qiq_i?

Amortized Variational Inference

Li=Ezqϕ(zxi)[logpθ(xiz)+logp(z)]undefinedJ(ϕ)=Ezqϕ(zxi)[r(xi,z)]+H(qϕ(zxi))L_i = \underbrace{\mathbb{E}_{z \sim q_\phi(z|x_i)}[\log p_\theta(x_i|z) + \log p(z)]}_{J(\phi) = \mathbb{E}_{z \sim q_\phi(z|x_i)}[r(x_i,z)]} + H(q_\phi(z|x_i))

We can directly calculate the derivative for the entropy term but the first term J(ϕ)J(\phi) seems a bit problematic.

But this format looks a lot like policy gradient!

J(ϕ)1Mjϕlogqϕ(zjxi)r(xi,zj)\nabla J(\phi) \approx \frac{1}{M} \sum_j \nabla_\phi \log q_\phi(z_j|x_i) r(x_i,z_j)

What’s wrong with this gradient?

The reparameterization trick

⚠️
In RL we cannot use this trick because we cannot calculate gradient through the transition dynamics, but with variational inference we can
qϕ(zx)=N(μϕ(x),σϕ(x))z=μϕ(x)+ϵσϕ(x),ϵN(0,1)q_\phi(z|x) = N(\mu_\phi(x), \sigma_\phi(x)) \\ z = \mu_\phi(x) + \epsilon \sigma_\phi(x), \epsilon \sim N(0,1)

This makes zz an independent function of a random variable ϵ\epsilon that is independent of ϕ\phi

So now:

J(ϕ)=Ezqϕ(zxi)[r(xi,z)]=EϵN(0,1)[r(xi,μϕ(xi)+ϵσϕ(xi))]\begin{split} J(\phi) &= \mathbb{E}_{z \sim q_\phi(z|x_i)}[r(x_i,z)] \\ &=\mathbb{E}_{\epsilon \sim N(0,1)}[r(x_i,\mu_\phi(x_i)+\epsilon \sigma_\phi(x_i))] \end{split}

And now to estimate ϕJ(ϕ)\nabla_{\phi} J(\phi)

This is useful because now we are using the derivative of the rr function

Another Perspective of LiL_i

Li=Ezqϕ(zxi)[logpθ(xiz)+logp(z)]+H(qϕ(zxi))=Ezqϕ(zxi)[logpθ(xiz)]+Ezqϕ(zxi)[logp(z)]+H(qϕ(zxi))undefinedDKL(qϕ(zxi)p(z))=Ezqϕ(zxi)[logpθ(xiz)]DKL(qϕ(zxi)p(z))=EϵN(0,1)[logpθ(xiμϕ(xi)+ϵσϕ(xi))]DKL(qϕ(zxi)p(z))logpθ(xiμϕ(xi)+ϵσϕ(xi))DKL(qϕ(zxi)p(z))\begin{split} L_i &= \mathbb{E}_{z \sim q_\phi(z|x_i)}[\log p_\theta(x_i|z) + \log p(z)] + H(q_\phi(z|x_i)) \\ &=\mathbb{E}_{z \sim q_\phi(z|x_i)}[\log p_\theta(x_i|z)] + \underbrace{\mathbb{E}_{z \sim q_\phi(z|x_i)}[\log p(z)] + H(q_\phi(z|x_i))}_{-D_{KL}(q_\phi(z|x_i)||p(z))} \\ &= \mathbb{E}_{z \sim q_\phi(z|x_i)}[\log p_\theta(x_i|z)]-D_{KL}(q_\phi(z|x_i)||p(z)) \\ &=\mathbb{E}_{\epsilon \sim N(0,1)}[\log p_\theta(x_i|\mu_\phi(x_i) + \epsilon \sigma_\phi(x_i))]-D_{KL}(q_\phi(z|x_i)||p(z)) \\ &\approx \log p_\theta(x_i | \mu_\phi(x_i) + \epsilon \sigma_\phi (x_i))-D_{KL}(q_\phi(z|x_i)||p(z)) \\ \end{split}

Reparameterization trick vs. Policy Gradient

Example Models

VAE Variational Autoencoder

maxθ,ϕ1Nilogpθ(xiμϕ(xi)+ϵσϕ(xi))DKL(qϕ(zxi)p(z))\max_{\theta, \phi} \frac{1}{N} \sum_i \log p_\theta(x_i|\mu_\phi(x_i) + \epsilon \sigma_\phi(x_i)) - D_{KL}(q_\phi(z|x_i)||p(z))

Conditional Models

Generates yy given x,zx, z

Li=Ezqϕ(zxi,yi)[logpθ(yixi,z)+logp(zxi)]+H(qϕ(zxi,yi))L_i = \mathbb{E}_{z \sim q_\phi(z|x_i, y_i)}[\log p_\theta(y_i|x_i,z) + \log p(z|x_i)]+H(q_\phi(z|x_i,y_i))